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The  first  goal  of  this  research  was  to  create  a  software-based  voice  conversion 
system  to  independently  and  automatically  modify  the  characteristics  of  human  voice. 
The  system  was  intended  to  generate  high  quality  test  tokens  for  speech  science  and 
psychoacoustic  studies.  The  second  goal  was  to  develop  algorithms  to  convert  voice  from 
one  speaker  to  that  of  another  speaker.  The  results  of  this  study  will  be  of  interest  to 
researchers  in  speech  analysis,  speech  synthesis  and  speaker  identification. 

The  key  ideas  for  our  voice  conversion  system  are  based  on  the  soiu"ce-tract 
production  model,  which  is  a  highly  parametric  representation  for  speech  analysis  and 
synthesis.  The  software  system  consists  of  three  subsystems,  a  speech  analyzer,  a 
parameters  modifier  and  a  speech  synthesizer,  which  extracts,  modifies  and  synthesizes 
five  types  of  acoustic  features,  respectively.  The  features  are  the  formant  frequency  and 
bandwidth,  the  shape  of  the  glottal  pulse,  the  voicetype  classification,  the  pitch  contour 
and  the  gain  contour.  The  first  two  types  of  parameters  are  frame-based,  and  they 
represent  the  speaker's  characteristics  of  the  vocal  tract  and  the  glottal  folds,  respectively. 
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The  final  three  parameters  form  the  controlling  parameters  for  our  system.  One  major 
feature  of  our  acoustic  model  is  that  the  controlling  parameters  are  independent  of  the 
other  parameters  so  that  they  control  the  way  of  how  the  frame-based  information 
concatenates,  such  as  changing  the  speaking  rate  or  increasing  the  voice  volume.  This 
makes  it  possible  to  mimic  the  characteristics  of  another  speaker's  voice,  including  the 
prosodic  features. 

The  voice  conversion  algorithms  are  based  on  a  speaker  adaptation  model  that 
treats  speaker  differences  as  arising  from  a  parametric  transformation.  The  voice 
conversion  task  is  then  realized  as  the  mapping  between  two  set  of  parameters.  Several 
experiments  were  conducted  to  test  the  performance  of  our  voice  conversion  algorithms. 
The  affine  transformation  method  proved  to  be  effective  for  converting  single-syllable 
words,  but  less  so  for  sentences.  Perhaps  this  is  because  a  sentence  has  more  locally 
dynamic  changes  than  the  capacity  of  our  linear  mapping  methods.  One  possible  way  to 
improve  is  to  include  a  phoneme  detector  in  our  system  and  estimate  the  piecewise 
mapping  functions  instead  of  one  linear  function  for  the  entire  speech. 
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CHAPTER  1 
INTRODUCTION 


1.1  Speech  Synthesis  and  Voice  Conversion 

Speech  is  perhaps  the  most  unique  capability  of  the  human  species.  For  years, 
engineers  and  scientists  have  conducted  extensive  research  on  it,  and  invented  hundreds  of 
products.  The  speech  signal  is  a  complex  acoustic  event  that  not  only  conveys  linguistic 
contents,  but  also  provides  information  about  the  speaker's  identity  and  such  personal 
characteristics  as  emotional  state,  age,  gender,  dialect,  and  the  status  of  his/her  health. 
Therefore,  the  quality  of  synthetic  speech  should  not  only  be  intelligible,  but  also  contain 
additional  aspects  about  the  "synthetic"  speaker.  The  aim  of  this  research  is  to  investigate 
the  factors  which  affect  the  quality  of  synthetic  speech  in  a  voice  conversion  system. 

A  voice  conversion  system  consists  of  three  functional  subsystems:  a  speech 
analyzer,  a  parameter  modifier,  and  a  speech  synthesizer.  Without  the  parameter  modifier, 
the  rest  works  as  a  normal  speech  synthesis  system  to  generate  a  synthetic  replica  of  a 
speech  signal.  With  the  parameter  modifier,  the  speech  content  is  to  remain  the  same  but 
the  "voice"  is  to  be  altered  or  modified  to  sound  like  that  of  another  speaker.  In  contrast 
to  typical  speech  generation,  e.g.,  a  creation  process  "from  text-to-speech  by  rules," 
voice  conversion  is  a  task  that  can  be  defined  as  "speech-to-speech  conversion"  with 
emphasis  on  the  controlling  of  voice  quality  with  suitable  parameters.  Thus,  this 
dissertation  is  concerned  with  developing  a  system  for  voice  conversion  that  can  be  used 
to  investigate  hypotheses  concerning  factors  affecting  speech  quality. 
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1.2  Review  of  Previous  Research 

1.2.1  Speech  Production  Mechanism  and  Synthesis 

Figure  1-1  (a)  illustrates  a  simplified  schematic  diagram  of  the  vocal  apparatus. 
Speech  is  the  acoustic  wave  that  is  radiated  from  the  lip  when  air  is  expelled  from  the 
lungs  and  the  resulting  flow  of  air  is  perturbed  by  a  constriction  somewhere  in  the  vocal 
tract. 

In  the  late  1950s  Fant  (1959)  proposed  the  well-known  source-filter  model  to 
simulate  the  speech  production.  In  his  scheme  the  speech  production  is  divided  into  two 
serial  blocks,  namely,  the  excitation  source,  which  models  the  waveform  of  the  air  flow, 
and  the  acoustic  modulation,  which  shapes  the  excitation  spectrum  to  form  various 
sounds.  Furthermore,  these  two  blocks  are  linearly  connected  and  operate  independently 
and  non-interactively.  Figure  1-1  (b)  shows  a  general  block  diagram  of  such  a  model. 
The  vocal  tract  and  radiation  effects  are  accounted  for  by  the  time-varying  linear  system. 
Its  purpose  is  to  model  the  acoustic  modulation.  The  excitation  generation,  which 
simulates  the  quasi-periodic  air  streams  and  turbulence,  is  the  energy  source  of  the 
system.  The  parameters  of  the  source  and  the  linear  filter  are  chosen  so  that  the  resulting 
output  has  the  desired  speech-like  properties. 

1.2.2  Three  Factors  in  Characterizing  Voice 

One  way  to  look  at  voice  conversion  is  to  consider  such  a  process  as  a  voice 
personality  transformation,  i.e.,  the  process  of  converting  one  person's  voice  to  sound  like 
that  of  another.  From  the  speech  production  model  point  of  view,  there  are  three  main 
factors  that  contribute  to  vocal  characteristics. 

One  factor  is  related  to  physiology.  The  overall  dimensions  of  the  vocal  tract  as 
well  as  the  relative  proportions  of  the  supra-glottal  cavities  (laryngeal,  oral,  and  nasal)  are 
factors  that  affect  speaker  vocal  characteristics.  For  an  uniform  lossless  tube  of  length  L, 
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Figure  1-1.  Speech  production  model. 

(a)  Schematic  diagram  of  vocal  apparatus; 

(b)  Source-filter  model  of  speech  production. 
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closed  at  one  end  and  open  at  the  other,  the  resonances  of  the  tube,  Fn,  can  be  written  as 


where  n  is  an  integer  and  c  in  the  speech  of  sound.  Since  a  formant  is  defined  as  a 
resonance  of  the  vocal  tract,  this  equation  identifies  the  formant  frequencies  of  a 
closed-open  uniform  vocal  tract  (Fant,  1960;  Titze,  1994).  As  demonstrated  by  Eq. 
(1-1),  the  variation  of  the  dimensions  of  the  vocal  apparatus  influences  vowel  formant 
frequencies,  thereby  causing  the  formant  frequencies  and  bandwidths  to  vary  from 
speaker  to  speaker. 

A  second  factor  is  also  closely  related  to  physiology.  The  variation  of  the 
dimension  and  tension  of  the  vocal  folds  leads  to  various  vocal  fold  vibrating  patterns. 
For  an  illustration  purpose,  assume  the  vocal  folds  to  be  equivalent  of  a  piano  or  violin 
strings:  uniform,  thin,  under  constant  tension,  fixed  anteriorly  and  posteriorly,  and  free  to 
move  otherwise.  Furthermore,  the  strings  are  driven  by  plucking  or  stroking.  The 
fundamental,  Fq,  is  then  predictable  by  the  well-known  formula  for  vibrating  strings 
(Benade,  1976): 


Here  L  is  the  length  of  the  vocal  folds,  a  is  the  longitudinal  stress  in  vocal  fold  tissue,  and 
Q  is  the  tissue  density.  Based  on  this  formula,  the  fundamental  frequency  is  inversely 
proportional  to  the  vocal  fold  length  and  directly  proportional  to  the  square  root  of  tissue 
stress.  Also  described  by  Ishizaka  and  Flanangan  (1972)  in  their  studies  of  the  glottal 
flow,  the  variations  in  the  dimensions  of  the  subglottal  apparatus  affect  the  pulse  width, 
pulse  skewness,  abruptness  of  closure,  and  the  spectral  tilt  of  the  glottal  pulse. 

A  third  factor  is  linked  to  the  dynamics  of  speech  production.  Each  speaker  has 
developed  his/her  own  articulatory  skill  to  produce  the  various  phonemes  of  their 


F„  =  (2n  -  !)(-§-) 


(1-1) 


(1-2) 
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language.  Speaking  habits  are  influenced  by  dialect  and  social  environment,  and  may 
lead  to  prosodic  variations  such  as  intonation,  stress,  and  duration  in  the  context  of 
speech. 

1.2.3  Voice  Conversion 

A  voice  conversion  system  should  simulate  the  variations  caused  by  the  three 
factors  mentioned  before.  This  task  is  beyond  the  capability  of  current  technology  and 
knowledge,  especially  the  simulation  of  changes  in  prosodic  strategy.  Most  previous 
researchers  have  focused  on  the  modifications  of  the  segmental  parameters  of  speech 
models.  In  other  words,  a  transformation  is  performed  frame  by  frame  by  mapping  the 
acoustic  space  of  one  speaker  onto  the  acoustic  space  of  another.  In  particular,  previous 
researchers  concentrated  on  the  variations  of  the  vocal  tract  and  the  fundamental 
frequency  of  voicing. 

In  the  early  1980s,  Kuwabara  (1984)  constructed  a  speech  analysis/synthesis 
system  with  the  capability  to  independentiy  manipulate  the  formant  frequencies  and 
bandwidths  of  voiced  speech.  The  system  was  based  on  the  linear  prediction  model  of 
speech  production  with  the  residual  signal  as  the  input  to  the  synthesis  filter.  However, 
there  was  no  specific  rule  of  transferring  parameters  from  one  acoustic  space  to  another  in 
this  research.  A  similar  analysis-synthesis  system  was  developed  by  Wu  (1985)  with  the 
focus  on  male/female  conversion. 

Abe  et  al.  (1988)  studied  a  voice  conversion  system  as  shown  in  Figure  1-2.  In 
order  to  simulate  a  spectrum  variation,  this  research  proposed  the  use  of  vector 
quantization  with  codebook  mapping.  The  basic  idea  was  to  develop  codebooks  that 
represent  the  various  speakers'  vocal  characteristics.  Then  codebooks  for  mapping  the 
spectrum  parameters,  power  values,  and  pitch  frequencies  were  developed  using  a  training 
utterance.  Speech  was  then  modified  by  applying  these  mapping  codebooks  and  finally 
synthesized  using  a  linear  predictive  coding  (LPC)  vocoder. 
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Figure  1-2  The  voice  conversion  system  used  by  Abe  et  al.  (1988)  and  Savic 
and  Nam  (1991). 
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A  similar  approach  was  developed  by  Savic  and  Nam  (1991)  with  the  codebook 
mapping  being  replaced  by  a  multilayer  neural  network.  These  results  are  encouraging. 
However,  the  speech  quality  obtained  was  limited  due  to  the  use  of  a  standard  LPC 
vocoder. 

Another  approach  was  presented  by  Valbret  et  al.  (1992),  who  replaced  the 
excitation  source  of  the  LPC  vocoder  by  the  Pitch-Synchronous-OverLap-and-Add 
(PSOLA)  residue  algorithm  to  improve  the  speech  quality.  A  simple  prosodic 
modification  scheme  was  also  introduced  by  them  as  shown  in  Figiire  1-3.  The  time  axis 
of  the  source  was  first  warped  in  order  to  align  it  with  the  target  speech  signal.  The 
excitation  signal  of  this  synthesizer  was  the  aligned  PSOLA  residue  of  the  source.  Besides 
modifying  the  excitation  source,  they  tested  and  compared  two  methods  for  deriving 
spectral  mapping:  linear  multivariable  regression  (LMR)  and  dynamic  frequency  warping 
(DFW).  Even  though  the  resulting  synthetic  speech  sounded  natural,  it  did  not  sound  like 
the  target  speaker's  voice. 

In  sum,  a  flexible,  user-friendly,  and  high  quality  voice  conversion  system  has  not 
yet  been  developed.  This  is  one  of  the  primary  objectives  of  this  research. 

1.3  Research  Goals  and  Plan 

1.3.1  Research  Goals 

Ideally  a  voice  conversion  system  should  simulate  the  variations  caused  by  the 
three  factors  mentioned  in  Section  1.2.2  and  have  the  ability  to  produce  the  speech  widi 
any  desired  vocal  quality.  In  the  proposed  system,  we  adopt  the  following  methodology 
to  achieve  this  task.  In  analysis  we  collect  three  sets  of  speech  parameters,  one  for  the 
vocal  tract,  one  for  the  voiced  source,  the  other  for  the  prosodic  feature  (i.e.,  intonation, 
the  contour  of  fundamental  frequency),  and  create  the  replica  of  the  signal  with  these  sets 
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Figure  1-3.  PSOLA  scheme  for  proscxiic  modification  by  Valbret  et  al.  (1992). 


9 


of  parameters  in  synthesis.  Most  of  all,  we  simulate  the  process  of  voice  conversion  by 
mapping  the  parameters  from  one  acoustic  space  to  another  in  the  transformation  process. 

In  hypothesis  these  sets  of  parameters  are  closely  related  to  the  three  basic  factors 
for  speech  production,  respectively,  and  each  set  of  parameters  can  be  manipulated 
independently.  To  be  specific,  suppose  that  we  have  collected  the  analytical  parameters 
for  two  speeches  which  were  spoken  by  two  different  speakers  but  the  same  content,  our 
methods  for  voice  conversion  are  listed  below: 

1.  The  parameter  set  for  the  linear  filter  represents  the  characteristics  of  the  vocal 
tract.  Therefore,  the  mapping  between  these  two  vectors  simulates  the 
conversion  of  the  length  of  vocal  tract  between  two  speakers. 

2.  The  parameter  set  for  the  glottal  source  represents  the  characteristics  of  the 
glottal  cords.  Therefore,  the  vector  distance  between  two  sets  of  this 
parameter  type  accounts  for  the  dimension  variation  of  vocal  folds  between 
different  speakers. 

3.  The  pitch  contour  represents  the  speech's  intonation.  Therefore,  the 
conversion  from  one  pitch  contour  to  another  accounts  for  the  intonation 
transformation. 

From  this  standing  point,  the  main  objectives  and  emphasis  of  the  research  are 
itemized  below: 

1.  To  build  a  voice  conversion  system  with  the  following  features: 

•  Flexibility.  Two  types  of  configuration,  namely,  the  formant  and  the  LP,  are 
used  in  this  system,  and  each  set  of  synthetic  parameters  can  be  manipulated 
independently. 

•  User-ftiendliness.  The  software  package  of  this  system  is  implemented  in 
graphic-user-interface  format.  Users  can  simulate  the  process  by  clicking 
the  mouse. 
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•  High  quality.  The  synthetic  speech  should  maintain  the  same  quality  level 
as  an  advanced  LP  synthesizer  or  formant  synthesizer  if  the  system  is 
operated  without  the  parameter  modifier. 

2.  To  develop  a  data  base  for  the  parameters  of  several  voice  types. 

3.  To  establish  a  mapping  function  between  the  parameters  of  the  source  speaker 
and  those  of  the  target  speaker. 

4.  To  investigate  the  factors  involved  in  the  simulation  of  voice  conversion. 

Note  that  the  research  is  limited  to  the  voiced  speech  since  an  accurate  and  reliable 
algorithm  to  find  the  vocal  tract  parameters  (formants)  for  the  unvoiced  speech  has  not  yet 
been  developed. 

1.3.2  Research  Plan 

The  research  plan,  as  shown  in  Figure  1^,  was  carried  out  in  the  following  three 

phases: 

1.  The  calibration  phase.  Two  successful  speech  analysis/synthesis  systems  (Hu, 
1993;  Shue,  1995)  were  modified  and  integrated  for  use  with  the  proposed 
voice  conversion  system. 

2.  The  training  phase.  Algorithms  for  parameter  transformations  were  developed 
during  this  phase.  The  data  for  the  acoustic  parameters  for  the  source  speech 
and  the  target  speech  were  collected,  analyzed,  and  aligned  by  the  dynamic 
time  warping. 

3.  The  verification  phase.  Two  voice  conversion  systems,  one  based  on  the 
formant  synthesizer  and  the  other  on  the  LP  synthesizer,  were  evaluated  and 
compared  via  an  informal  listening  test.  The  goal  was  to  investigate  the 
factors  relevant  to  the  synthesis  of  high  quality  speech  with  desired  vocal 
characteristics.  The  task  was  accomplished  by  statistical  analysis  of  the  data. 
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Figure  1-4.  Block  diagram  of  the  proposed  research  plan. 


12 


1.4  Description  of  Chapters 

Chapter  2  introduces  our  voice  conversion  system  which  consists  of  three 
subsystems,  a  speech  analyzer,  a  parameter  modifier  and  a  speech  synthesizer.  Five  types 
of  acoustic  features  are  abstracted,  modified  and  synthesized  in  the  corresponding 
subsystems.  There  are  two  types  of  implementations  for  this  system:  the  formant 
configuration  and  the  LP  configuration.  We  briefly  review  the  basic  concept  of  the 
formant  and  LP  synthesis  schemes,  and  thereafter  discuss  their  implementation.  The 
analysis  procedures  as  well  as  the  synthesis  procedures  will  also  be  presented. 

Chapter  3  continues  the  development  of  the  voice  conversion  system.  We  focus  on 
the  parameter  modifier  that  modifies  the  analyzed  parameters  of  the  speech  signal  in  order 
to  synthesize  a  desired  voice.  In  particular,  we  are  interested  in  modifying  the  pitch 
contour,  the  gain  contour,  the  resonant  tract,  and  the  glottal  pulse  shape  of  the  speech 
signal.  The  techniques  as  well  as  the  associated  problems  arising  from  the 
implementation  will  be  discussed  in  detail. 

Chapter  4  presents  our  approach  to  convert  voice  from  one  speaker  to  another 
speaker.  This  research  is  one  model  for  studying  factors  responsible  for  the  quality  of 
synthetic  speech  and  for  the  speaker  normalization.  We  introduce  two  adaptive  models  to 
describe  the  differences  between  two  sets  of  parameters.  Then  our  voice  conversion 
algorithms  are  developed  based  on  this  parameter  transformation  platform.  We  describe 
several  experiments  to  test  the  performance  of  our  voice  conversion  algorithms.  This 
chapter  is  concluded  by  discussing  the  experimental  results. 

A  flexible,  user-fiiendly,  high  quality  voice  conversion  system  is  implemented  as 
a  research  tool.  Chapter  5  describes  the  software  features  and  the  design  concepts. 
Finally,  Chapter  6  summarizes  the  conclusions  and  describes  future  work  to  improve  the 
voice  conversion  system. 


CHAPTER  2 
VOICE  CONVERSION  SYSTEM 


Our  voice  conversion  system  consists  of  three  subsystems,  a  speech  analyzer,  a 
parameter  modifier  and  a  speech  synthesizer,  as  illustrated  in  Figure  2-1.  The  speech 
signal  is  the  input  to  the  analyzer  which  provides  three  sets  of  speech  parameters  for  our 
acoustic  model,  the  excitation  source  parameters,  the  excitation  control  parameters  and 
the  resonant  tract  parameters.  These  sets  of  parameter  can  be  independently  altered  or 
modified  in  the  parameter  modifier.  Using  those  parameters,  a  synthetic  replica  of  the 
speech  signal  is  created  according  to  the  specified  speech  production  model.  Without  the 
parameter  modifier,  the  voice  conversion  system  works  as  a  basic  speech 
analysis/synthesis  system.  The  purpose  of  this  chapter  is  to  describe  the  mechanism  of  the 
system,  as  well  as  the  procedures  to  estimate  the  speech  parameters  and  synthesize  the 
speech.  The  algorithms  for  modifying  the  speech  parameters,  the  core  part  of  a  voice 
conversion  system,  will  be  developed  in  the  next  chapter. 

Our  main  contributions  to  the  design  of  this  system  are  (1)  the  flexibility 
incorporated  in  the  system  configuration,  which  consists  of  two  excitation  models  and  two 
resonant  tract  models;  (2)  the  independent  control  algorithms  for  the  three  sets  of  speech 
parameters  and  (3)  a  means  for  combining  the  excitation  and  the  resonant  tract  without 
further  interpolation  of  the  parameters  during  synthesis. 

2.1  Acoustic  Models  of  Speech  Production 

The  key  ideas  in  our  analysis/synthesis  system  are  based  on  the  source-filter 
production  model  proposed  by  Pant  (1959).  This  model  has  been  widely  used  for 
generating  high  quality  speech  as  well  as  for  studying  the  acoustic  aspects  of  speech 
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Figure  2-1  The  voice  conversion  system. 
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production  (Klatt  and  Klatt;  Childers  and  Hu,  1995).  Theoretically,  the  speech  production 
process  is  divided  into  three  basic  functions:  1)  the  source  filter,  2)  the  vocal  tract  filter, 
and  3)  the  radiation  filter,  as  illustrated  in  Figure  2-2  (a).  The  effect  of  the  radiation  filter 
is  usually  included  in  the  excitation  source  since  these  three  filters  are  assumed  to  be 
linearly  connected. 

In  synthesis,  the  excitation  control  parameters,  such  as  the  voiced/unvoiced 
classification,  the  pitch  information  and  excitation  gain,  has  to  be  known  in  advance  for 
generating  the  excitation  pulse.  Therefore,  the  acoustic  models  of  our  speech  production 
consists  of  three  submodels,  the  excitation  control  model,  the  excitation  source  model  and 
the  resonant  tract  model,  as  shown  in  Figure  2-2  (b).  The  speech  is  synthesized  as 
follows:  the  excitation  control  model  produces  the  control  information  to  trigger  the 
source  model  to  generate  the  excitation  pulse,  and  the  pulse  is  then  filtered  by  the  resonant 
tract  model. 

Since  the  speech  production  is  represented  by  these  submodels,  the  speech  signal 
can  be  represented  by  the  parameters  of  the  corresponding  submodels.  Namely,  the 
acoustic  characteristics  of  the  speech  signal  are  highly  parameterized  by  these  submodels 
such  that  the  speech  features  can  be  transformed  to  the  acoustic  domain  required  for  the 
new  synthetic  voice.  The  algorithms  for  modifying  speech  features  will  be  developed  in 
detail  in  the  next  Chapter. 

2.1.1  Excitation  Control  Model 

One  major  feature  of  our  acoustic  model  of  speech  production  is  that  the 
controlling  parameters  are  independent  of  the  source  model  and  the  tract  model.  The 
source  and  tract  parameters  are  frame-based  information,  while  the  excitation  control 
parameters  control  the  way  those  ft"ame-based  information  concatenates,  such  as  speeding 
up  or  down.  There  are  three  types  of  parameters  determined  by  the  control  model,  the 
voiced/unvoiced  classification,  the  excitation  gain  and  the  excitation  waveform  length 
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(a)  Schematic  diagram  of  a  speech  production  model. 

(b)  Simplified  diagram  of  our  speech  production  model. 
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(called  the  pitch  period).  The  purpose  of  the  excitation  control  model  is  to  supply  the 
necessary  information  for  the  source  model  to  generate  the  suitable  excitation  pulse.  In 
that  sense,  the  excitation  control  model  is  an  analog  of  part  of  the  human  nervous  system, 
which  controls  the  muscle  of  the  vocal  cords.  The  relationship  between  the  excitation 
control  model  and  the  excitation  source  model  is  illustrated  in  Figure  2-3. 

Speech  sounds  may  be  classified  as  either  voiced  or  unvoiced.  This  requires  that 
the  source  model  produce  either  a  quasi-periodic  pulse  waveform  or  a  random  noise 
waveform  to  excite  the  resonant  tract.  In  our  system,  a  waveform  model  is  used  to 
generate  quasi-periodic  pulses  for  voiced  sounds,  while  a  256-entry  stochastic  codebook 
(Hu,  1993)  is  used  for  unvoiced  speech  synthesis.  And  the  voiced/unvoiced  classification 
determines  which  source  model  is  used  to  generate  the  excitation  waveform. 

For  unvoiced  excitation,  due  to  the  lack  of  an  appropriate  criterion  for 
characterizing  performance,  we  empirically  code  the  residue  of  a  5msec  duration  (50 
samples  at  10  kHz).  For  voiced  excitation,  the  length  of  the  excitation  waveform  is 
determined  by  the  pitch  period  that  is  the  duration  between  tiie  glottal  closure  instants 
(GCI).  This  GCI  sequence  is  a  parameter  that  belongs  to  the  excitation  control  model  and 
provides  the  synchronous  timing  information  for  speech  synthesis  process. 

The  pitch  period  (i.e.,  the  reciprocal  of  the  fundamental  frequency)  is  an  important 
acoustic  feature  for  assessing  individual  differences  in  voice  quality.  One  important 
feature  of  this  parameter  is  that  die  pitch  period  varies  tiiroughout  an  utterance.  The  plot 
of  the  pitch  period  for  an  utterance  is  called  the  pitch  contour,  which  reveals  the  attitudes 
and  feelings  of  the  speaker  in  ways  the  segmental  information  alone  can  never  do. 

Another  important  factor  affecting  the  excitation  waveform  is  the  excitation  gain, 
which  is  a  function  that  controls  the  power  (amplitude)  of  the  excitation  waveform.  For 
most  excitation  source  models,  the  excitation  amplitude  is  normalized  and  modulated  by 
the  gain  factor.  This  parameter  is  closely  related  to  the  intensity  or  sound  pressure  and  is 
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Figure  2-3.  The  relationship  between  the  excitation  control 
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an  important  factor  affecting  the  speech  synthesis  quality.  The  gain  is  a  record  of  the 
average  speech  energy  of  each  pitch  period.  In  synthesis  the  gain  scales  the  normalized 
excitation  waveform  produced  by  the  source  model. 

2.1.2  Resonant  Tract 

The  transfer  function  of  the  network  is  defined  as  the  ratio  of  the  Laplace 
transform  of  the  sound  pressure  from  the  lips  of  the  speaker  to  the  volume  velocity  of  the 
air  flow  passing  through  the  vocal  folds.  Based  on  the  source-filter  theory,  there  are  three 
types  of  implementations  for  the  transfer  function,  namely,  the  articulatory  configuration, 
the  formant  configuration,  and  the  LP  configuration.  Since  a  high-quality  speech 
synthesizer  is  not  available  for  the  articulatory  configuration  at  this  time,  we  limit  our 
research  to  the  formant  and  the  LP  configurations. 

Both  configurations  can  be  realized  by  an  all-pole  system  to  simulate  the 
resonances  (formants)  of  the  speech,  which  has  been  shown  to  be  a  good  representation  of 
the  vocal  tract  for  a  majority  of  speech  sounds,  including  voiced  and  unvoiced  phonations 
(Atal  and  Hanauer,  1971;  Lalwani  and  Childers,  1991).  Note  that  only  one  resonant  filter 
is  employed  in  our  synthesis  system  instead  of  multiple  branches. 

2.1.2.1  Formant  configuration 

Based  on  the  perceptual  characteristics  of  the  human  auditory  system,  a  formant 
synthesizer  is  developed.  The  general  configuration  of  such  a  synthesizer  is  shown  in 
Figure  2-4.  The  transfer  function  acts  as  a  filter  with  various  resonances,  called 
formants,  to  shape  the  spectral  characteristics  of  speech.  Normally,  for  a  speech  sampling 
frequency  of  10  kHz,  the  first  five  formants  are  adequate  to  represent  the  vocal  tract  for 
voiced  sounds  (Klatt  and  Klatt,  1990).  That  is,  the  information  contained  in  five  formant 
frequencies  and  five  formant  bandwidths  can  be  used  to  construct  a  filter  to  shape  the 
spectrum  of  the  glottal  waveform  to  produce  a  synthetic  speech  signal. 
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Figure  2~i.  Block  diagram  of  a  cascade  formant  synthesizer. 
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For  a  pole  Zj  with  angle  (j)^  and  radius    in  the  z-domain,  its  transfer  function  is 
given  by 


H(z)  =  ^ 


1  -  r^ej*.  z-i  '  (2-la) 
and  its  power  spectrum  is  given  by  , 

H(ej»)|2  =  -  }^  ,  ,  (2-lb) 

1  -  2rjCos(0  -      +  r? 

If  the  sampling  frequency  is  Fs,  the  corresponding  resonant  frequency  and  bandwidth  are 
defined  as  follows 

Formant  Frequency     2^  *     ,  (2-2a) 

4r-  —  1  —  r^ 

Formant  Bandwidth  =  cos" H  '—^^  i-)/n*Fs  .  (2-2b) 

The  bandwidth  is  calculated  by  finding  the  frequency  where  the  spectral  energy  is  3-dB 
below  the  peak.  In  practice  the  poles  are  complex-conjugate-paired  to  form  an 
autoregressive  filter  with  real  coefficients.  Therefore  the  transfer  function,  Vi(z),  for  the 
ith  formant  can  be  written  as 

"  (1  -  riej*iz-i)(l  -rie-j*iz-i) 

By  cascading  5  formants,  a  10th  order  transfer  function,  V(z),  can  be  formulated  to 
describe  the  vocal  tract  for  voiced  sounds. 


1  =  1  ^ 


—no  (2-4) 
ajoz 


where  A(z)  is  the  inverse  of  V(z). 
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An  advantage  of  formant  synthesis  is  that  the  relationship  between  the  fonnants 
and  the  vocal  tract  configuration  for  vowels  is  well  understood.  This  configuration  offers 
the  possibility  of  synthesizing  a  new  utterance  from  theoretical  parameters  (Klatt,  1980, 
1990). 

2.1.2.2  Linear  prediction  configuration 

The  linear  prediction  model  regards  the  speech  signal  as  an  autoregressive  process 
and  is  used  to  estimate  the  poles  of  a  signal.  The  basic  idea  behind  this  process  is  that  a 
signal  sample,  s(n),  can  be  estimated  by  a  linear  combination  of  its  past  samples,  s(n-l), 
s(n-2), s(n-p),  i.e.. 


wheie  s(n)  is  the  estimated  signal  at  instant  n,  p  is  the  number  of  past  samples  used  to 
predict  the  next  sample,  and  the  ak  are  the  linear  predictive  coefficients  for  each  past 
sample.  The  LP  coefficients  are  determined  by  minimizing  the  total  error,  E,  which  is  the 
sum  of  the  squared  differences,  e(n),  for  N  sequential  samples 


p 


s(n)  =  ^  ajj  s(n  -  k  ) 


(2-5) 


k  =  l 


N 


E  =  ^  cHn) 


(2-6) 


e(n)  =  s(n)  -  s(n)  =  s(n)  -  ^  aj^  s(n  -  k) 

k  =  l 


(2-7) 


Furthermore,  Eq.  (2-7)  can  be  transformed  to  the  Z-domain  as 


S(z)  = 


E(z) 


=  E(z)  V(z) 


(2-8) 


A(z) 
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P 

where     A(z)  =  1  -  z"''  (2-9) 

k  =  l 

V(z)  =  ^   (2-10) 

k  =  l 

and  V(z)  is  the  all-pole  transfer  function  of  the  estimated  vocal  tract  filter.  A  typical  LP 
synthesizer  is  shown  in  Figure  2-5. 

An  advantage  of  LP  synthesis  is  that  it  is  capable  of  reproducing  any  speech  sound 
and  can  readily  estimate  the  synthesis  parameters;  however,  traditional  LP  synthesizers 
can  produce  artifacts  (pop  and  cUck  sounds)  if  the  modelling  process  used  for  the 
excitation  source  is  inadequate.  Recent  research  has  shown  that  various  LP  synthesizers, 
such  as  the  code  excited  linear  predictive  (CELP)  synthesizer  (Atal  and  Remde,  1982; 
Singhal  and  Atal,  1989)  and  the  glottal  excited  linear  predictive  (GELP)  synthesizer 
(Childers  and  Hu,  1994),  can  result  in  high-quality  synthetic  speech  if  the  excitation 
function  is  represented  by  an  adequate  number  of  codewords  or  pulses. 

2.1.2.3  Comments  on  these  two  resonant  configurations 

The  AR  filter  in  an  LP  configuration  preserves  the  spectral  envelop  of  the  speech 
signal.  It  contains  the  resonances  of  the  vocal  tract  as  well  as  the  spectral  aspects  of  the 
glottal  source,  the  radiation  impedance,  aspiration  noise  and  a  mathematical  balance  term 
(Olive,  1992).  This  scheme,  however,  is  not  suitable  for  a  physiological  study  of  the 
human  speech  production,  since  the  resonant  tract  parameters  (LP  coefficients)  show  little 
relation  to  the  anatomy  and  physiology  of  speech  production. 

On  the  other  hand,  it  is  difficult  to  estimate  the  resonant  tract  parameters  (formant 
frequencies  and  bandwidths)  of  a  formant  configuration,  especially  for  unvoiced  sounds. 
Therefore,  we  include  both  synthesis  schemes  in  our  system  to  investigate  the  factors  that 
influence  the  quality  of  synthetic  speech. 
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Figure  2-5.  Block  diagram  of  a  typical  LP  synthesizer. 
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2.1.3  Excitation  Source  model 

There  are  two  types  of  excitation  sources  in  our  system.  One  is  voiced,  which 
involves  quasi-periodic  vibrations  of  the  vocal  folds  and  the  resulting  waveform  looks 
like  a  quasi-periodic  pulse  train.  The  other  is  unvoiced,  which  involves  the  generation  of 
turbulence  noise  by  rapid  flow  of  air  past  a  narrow  consttiction  and  the  waveform  looks 
like  a  random  noise  signal. 

2.1.3.1  Random  noise  model:  unvoiced 

Since  our  research  is  focused  on  converting  voiced  sounds,  a  stochastic  codebook 
developed  by  Hu  (1993)  is  used  in  our  system  for  unvoiced  sounds.  The  codebook 
contains  256  codewords,  which  code  the  residue  for  a  5  msec  duration  (50  samples  at  10 
kHz).  Samples  for  each  codeword  are  drawn  from  a  Gaussian  noise  generator,  however, 
they  are  created  under  the  following  three  schemes: 

3.  (64  entries)  -  Each  codeword  contains  16  non-zero  samples.  The  positions  of 
non-zeros  samples  exhibit  a  uniform  distribution  from  1  to  50. 

4.  (64  entries)  -  Each  codeword  contains  32  non-zero  samples.  The  positions  of 
non-zeros  samples  exhibit  a  uniform  distribution  from  1  to  50. 

5.  (128  entries)  -  Every  sample  is  take  from  a  Gaussian  noise  generator. 

2.1.3.2  Excitation  waveform  model:  voiced  ,     ,  .  . 

The  shape  of  the  glottal  excitation  pulse  has  been  shown  to  greatiy  affect  the 
quality  and  naturalness  of  synthetic  speech  (Rosenberg,  1971;  Naik,  1984).  A  successful 
source  model  should  not  only  capture  the  necessary  features  of  the  glottal  pulse  for 
synthesizing  high  quality  speech  but  also  have  the  capability  of  representing  the 
differences  between  various  speakers.  From  a  physiological  point  of  view,  voiced 
excitation  is  generated  by  the  vibration  of  vocal  folds,  and  a  good  source  model  should 
produce  a  glottal  waveform  similar  to  that  of  the  human  glottal  apparatus.   A  major 
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problem  with  this  approach  is  that  it  is  difficult  to  measure  directly  the  glottal 
volume-velocity,  since  the  vocal  folds  are  located  below  the  phyarynx. 

On  the  other  hand,  from  a  mathematical  point  of  view,  the  ideal  excitation  source 
is  the  residue  signal  obtained  by  inverse  filtering  the  original  speech  signal.  Direct 
waveform  coding  of  the  residue  requires  a  large  bandwidth,  therefore  it  is  not  a  practical 
approach.  One  simple  excitation  model  for  speech  synthesizers  is  to  use  an  impulse  train 
for  voiced  sounds  and  random  noise  for  unvoiced  sounds  (Atal  and  Hanauer,  1971). 
However,  the  synthetic  speech  from  this  simple  model  is  judged  unnatural  and  not  capable 
of  carrying  information  about  speakers'  vocal  characteristics. 

Over  the  years,  various  excitation  models  have  been  proposed.  Attempts  have 
been  made  to  use  more  realistic  waveforms  instead  of  an  impulse  train  as  the  excitation 
source  for  generating  high  quality  speech  (Rosenberg  et  al.,  1975;  Fant  et  al.,  1985; 
Milenkovic,  1993;  Childers  and  Hu,  1995).  The  polynomial  model  and  the  LF  model  are 
two  such  examples. 

In  order  to  facilitate  the  subsequent  discussion,  we  adopt  the  LF  model  (Fant  et  al., 
1985)  as  an  explanatory  media  to  illustrate  the  characteristics  of  a  glottal  waveform.  The 
differentiated  glottal  waveform  by  the  LF  model  is  described  as 


Eoe«'sinwgt  0  <  t  <  te 

EoL -£(t-t.,  _  e-e(t,-0 

£ta  L 


E(t)=  ,        ^  , 

te  <  t  <  tc 


(2-11) 


where  tp,  te,  tc  are  the  instants  indicating  the  maximum  glottal  flow,  the  maximum  glottal 
closing  rate  and  complete  glottal  closure,  respectively.  The  parameter  ta  is  used  to  control 
the  abruptness  of  the  return  phase,  and  the  parameter  Wg,  defined  as  Jt/tp,  determines  the 
frequency  of  the  sinusoid.  Parameter  Eo,  a  and  e  are  for  computational  use  only.  A 
typical  glottal  flow  and  its  differentiated  glottal  flow  are  shown  in  Figure  2-6. 
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Figure  2-6. 


1.  A  typical  glottal  flow  (solid  line)  and  its  differentiated 
glottal  waveform  (dashed  line). 


Note  that  the  LF  parameters  (tp,  te,  tc,  ta)  specify  the  glottal  flow  as  well  as  the 
crucial  timing  information  about  the  vocal  folds  vibrating.  The  first  segment  of  the  LF 
model  characterizes  the  differential  glottal  flow  over  the  interval  from  the  glottal  opening 
to  the  instant  of  glottal  closure.  The  second  segment  represents  a  residual  glottal  flow  that 
comes  after  the  first  instant  of  closure  until  complete  closure.  With  this  brief  information 
about  the  characteristics  of  the  glottal  flow,  we  develop  our  excitation  waveform  models 
as  follows. 

2.1.3.2.1  Polynomial  model 

Polynomial  fitting  is  one  way  to  find  an  approximate  low  frequency  waveform  for 
the  differentiated  glottal  volume-velocity  (Childers  and  Hu,  1994;  Shue,  1995).  The 
polynomial  function  can  be  written  as 

p(t)  =  Co  +  Ci:^;  +  C2(^)2  +  C^i^)^  +  C4(;^;)^  +  C^(^)^  +  C^i^)^ 

T  >  t  >  0  (2-12) 

where  t  is  the  time  variable,  T  is  the  pitch  period.  The  polynomial  coefficients,  Co,  Ci,  Cz, 
C3,  C4,  C5,  C6,  are  determined  by  fitting  the  polynomial  function,  p(t),  to  the  estimated 
differentiated  glottal  waveform  in  a  least-squares  sense.  The  order  of  the  polynomial  is 
chosen  to  be  six  (Hu,  1993;  Shue,  1995). 

The  derivation  of  the  polynomial  coefficients  is  an  optimization  problem,  which  is 
usually  known  as  a  polynomial  fitting  algorithm.  The  fitting  algorithm  itself  is  not  the 
main  concern  of  this  research;  we  are  interested  in  how  this  model  describes  the  glottal 
flow.  In  other  words,  Eq.  (2-12)  within  the  interval  [0,T]  must  meet  the  phase 
characteristics  of  a  differentiated  glottal  flow, 

1.   The  local  minima  of  the  waveform  are  located  at  p(0)  and  p(T). 
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T 

2.    j"  p(t)dt  =  0.  This  is  to  ensure  zero  gain  of  the  flow  over  a  pitch  period. 
0 

where  the  interval  boundaries,  0  and  T,  correspond  to  the  glottal  closure  instants  (GCI). 
The  first  constraint  states  that  the  abrupt  flow  termination  occurs  at  the  GCIs. 

To  acquire  the  polynomial  coefficients  under  these  constraints,  we  can  introduce 
Lagrange  mutipliers  and  solve  a  set  of  equations  as  in  an  optimal  control  system. 
Nonetheless,  the  main  purpose  of  these  constraints  is  not  to  limit  the  dynamics  of  the 
polynomial  coefficients  while  carrying  out  the  optimization.  Instead,  the  constraints  are 
just  used  to  regulate  the  polynomial  waveform.  Referring  to  Figure  2-6,  the 
differentiated  glottal  waveform  varies  around  the  glottal  closure  instant  Therefore,  we 
design  the  following  pre-process  to  emphasize  the  waveform  around  the  glottal  closure 
instants  (GCI). 

1.  Set  the  value  of  the  staning  point  of  the  estimated  waveform  to  zero. 

2.  Normalize  the  amplitude  of  the  waveform  by  dividing  the  signal  by  the  largest 
positive  amplitude. 

3.  Apply  a  weighting  function  to  magnify  the  slope  of  the  waveform  around  the 
GCIs,  while  maintaining  the  other  portion  of  the  waveform. 

Note  that  since  the  waveform's  slope  on  the  left  side  of  the  GCI  is  usually  smaller 
than  the  one  in  the  right  side  (refer  to  Figure  2-6),  the  weighting  is  largest  in  the 
beginning  region  than  in  the  ending  region.  The  weighting  function  is  designed  as 


W(t)  = 


{ 


200t2  -  40t  +  3 
1 

25t^  -  40t  +  17 


for  0  <  t  <  0.1  [T] 
for  0.1  <  t  <  0.8  [2 
for  0.8  <  t  <  1  m 


(2-13) 
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The  weighting  function  is  displayed  in  Figure  2-7.  The  application  of  this 
weighting  function  does  not  change  the  amplitude  of  the  waveform  since  its  maximum 
positive  value  is  usually  located  in  region  [2]  and  the  minimum  value  is  set  to  zero. 
However,  the  slope  of  the  waveform  in  regions  Q]  and  [3]  is  emphasized,  which  is  in 
accord  with  the  first  constraint.  In  practice,  the  weighting  function  can  also  reduce  the 
chance  of  rank  deficiency  while  we  perform  the  polynomial  fit 

Now  we  are  ready  to  find  a  tentative  polynomial  based  on  a  least  square  fitting  the 
weighted  normalized  waveform  with  Equation  (2-12).  The  second  constraint  can  be 
satisfied  by  adjusting  the  tentative  polynomial  as  follows.  The  constant  Co  is  modified  to 
accomplish  this  requirement  by, 

^0  -  -  Zrrr  (2-14) 

i=l 

which  only  changes  the  D.C.  level  of  the  waveform.  Consequentiy,  the  resulting  integral 
for  a  pitch  period  becomes, 

[  p(t)dt  =  Xrri  =  0  (2_i5)  .  ' 

An  example  of  a  waveform  obtained  by  this  model  is  shown  in  Figure  2-8  for  both 
the  glottal  flow  and  its  derivative.  ,  '  .     }       %    ;        /  . ' 

■      ' :       ,   ,  -  V.      :-s  -  ^ 

2.1.3.2.2  Transformed  LF  model      ,      ,  ,  ..        .  , 

Although  the  polynomial  glottal  source  model  is  robust,  the  lack  of  a  correlation 
of  the  model  parameters  with  physiology  is  troublesome.  As  illustrated  in  the  previous 
section,  the  parameters  of  the  LF  model  parameters  are  closely  related  to  the  acoustic 
features  of  the  glottal  volume  velocity,  however,  finding  the  LF  model  parameters  is  not 
an  easy  task.  Procedures  have  been  proposed  to  fit  the  estimated  glottal  waveform  from 
the  glottal  inverse  filtering  with  the  LF  model  timing  parameters  (Childers  and  Ahn, 
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Figure  2-7.  Display  of  the  weighting  function,  W(t). 
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Figure  2-8.  (a)  The  differentiated  glottal  waveform  (solid  line)  and 
its  polynomial  model  waveform  (dashed  line),  (b)  The 
glottal  volume  velocity  (solid  line)  and  its  polynomial 
model  waveform  (dashed  line). 
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1994)  .  A  major  problem  is  labeling  the  glottal  opening  instant.  Such  information  can  be 
obtained  from  an  auxiliary  EGG  signal.  We  selected  a  simplified  version  of  LF  model 
(Fant,  1995),  which  does  not  need  to  estimate  the  glottal  opening  instant,  as  a  supplement 
to  the  polynomial  source  model. 

The  transformed  LF  model,  as  illustrated  in  Figure  2-9,  has  only  one  waveshape 
parameter,  R^,  to  control  the  phase  characteristics  of  the  waveform.  The  normal 
covariation  among  LF-parameters  and  the  basic  waveshape  parameter  permits  the 
prediction  of  the  default  LF  parameters  from  R<i,  which  is  a  unique  function  of  uq,  Ce  and 
fo  and  can  be  derived  from  these  parameters  using  inverse  filtering  results.  Namely,  once 
Rd  is  obtained,  the  LF  parameters  (tp,  te,  tc,  ta)  can  be  derived  accordingly. 

The  main  importance  of  the  R<i  parameter  is  that  it  is  the  most  effective  single 
measure  for  describing  voice  quality  and  simplifies  the  description  of  text-to-speech 
source  rules.  The  main  range  is  0.4  >  >  0.27.  For  a  more  detailed  description  of 
this  parameter  refer  to  Fant  (1995).  Figure  2-10  shows  an  example  of  fitting  the 
differential  glottal  waveform  with  this  model. 

2.2  Speech  Synthesizer 

Our  research  focus  is  on  developing  a  flexible  voice  conversion  system  and 
deriving  the  rules  for  converting  the  speech  parameters  from  one  acoustic  space  to 
another.  Therefore  we  can  develop  our  system  based  on  an  existing  speech  synthesizer 
and  integrate  it  with  our  parameter  modifier.  We  select  two  successful  speech 
synthesizers,  a  glottal  excited  linear  predictive  (GELP)  speech  synthesizer  (Childers,  Hu, 

1995)  and  a  formant-based  linear  predictive  (FBLP)  speech  synthesizer  (Shue,  1995)  as 
our  basic  speech  analysis/synthesis  system.  Both  systems  were  developed  in  our 
laboratory  and  have  proven  to  produce  high-quality  synthetic  speech  for  several  voice 
types. 
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(a)  the  glottal  waveform 


Uo 


(b)  the  differentiated  glottal  waveform 


Figure  2-9.  The  LF  model  extended  to  include  the  Rj  parameter. 
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transformed  LF  model 


\ 


(a)  the  differentiated  glottal  waveform 


(b)  the  glottal  volume  velocity 


Figure  2-10.  The  differentiated  glottal  waveform  and  its  transformed 
LF  model. 
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2.2.1  Glottal  Excited  Linear  Prediction  Synthesizer 

The  GELP  synthesizer,  which  synthesizes  speech  with  high  quality,  provides  the 
speech  scientist  a  simple  speech  synthesis  procedure  that  uses  established  analysis 
techniques,  that  is  able  to  reproduce  all  speech  sounds,  and  yet  also  has  an  excitation 
model  waveform  that  is  related  to  the  derivative  of  the  glottal  flow  and  the  integral  of  the 
residue  (Childers  and  Hu,  1995).  In  other  words,  using  a  6th  order  polynomial  waveform 
to  model  the  glottal  excitation  makes  the  GELP  synthesizer  different  from  traditional  LP 
synthesizers,  and  provides  ways  to  measiu-e  the  aspects  of  the  glottal  waveform.  In 
addition,  this  source  model  enhances  the  quality  of  synthetic  speech.  If  the  recorded 
speech  is  played  back  via  loadspeakers  in  an  A-B  test,  listeners  find  it  difficult  to 
discriminate  the  synthetic  speech  from  its  original  counterpart  (Hu,  1993).  The  system  is 
illustrated  in  Figure  2-11. 

2.2.2  Formant  Based  Linear  Prediction  Synthesizer 

The  formant-based  linear  prediction  (FBLP)  synthesizer  is  a  hybrid  system  that 
uses  the  formant  synthesis  scheme  to  produce  voiced  sounds  and  the  LP  synthesis  scheme 
to  generate  unvoiced  sounds  (Shue,  1995).  The  system  is  illustrated  in  Figure  2-12.  The 
vocal  tract  is  characterized  by  five  formants  (tenth  order  polynomial)  for  voiced  sounds 
and  thirteenth  order  linear  prediction  coefficients  for  unvoiced  sounds.  Depending  upon 
the  classification  of  voiced/unvoiced  sound,  one  of  two  categories  of  speech  synthesis  is 
used.  Once  the  segment  of  speech  is  classified  as  voiced,  the  formant  estimation  process 
assigns  the  appropriate  roots  to  simulate  the  vocal  tract.  Otherwise,  the  LP  coefficients 
will  be  used  to  represent  this  segment  of  speech. 

2.2.3  The  Proposed  Synthesizer  for  Voice  Conversion 

In  the  above  discussion,  we  found  that  either  the  LP  speech  synthesizer  or  the 
formant  speech  synthesizer  can  reproduce  high-quaUty  synthetic  speech,  and  each  scheme 
has  its  drawbacks.  The  speech  analysis/synthesis  system  we  implemented  here  is  a  hybrid 
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Figure  2-11.  The  block  diagram  of  the  GELP  synthesizer. 
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Figure  2-12.  The  block  diagram  of  the  formant  based  LP  synthesizer. 
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system,  the  user  can  choose  either  a  formant  or  an  LP  configuration  for  voiced  sounds, 
while  the  LP  representation  is  used  for  unvoiced  sounds.  For  the  voiced  excitation  source, 
we  use  either  the  polynomial  model  or  the  transformed  LF  model.  The  speech  synthesis 
system  is  illustrated  in  Figure  2-13.  One  important  feature  of  our  system  is  that  we 
introduce  the  control  model,  including  the  voiced/unvoiced  classification,  the  pitch 
contour  and  the  gain  contour,  to  control  the  system  operation.  This  feature  enables  us  to 
mimic  the  "speaking  style"  used  by  different  speakers. 

2.3  Speech  Analysis  and  Synthesis 

The  speech  signal  can  be  considered  as  a  quasi-stationary  event,  that  is,  the 
properties  of  the  speech  signal  change  relatively  slowly  with  time.  It  is  possible  to  use  a 
set  of  parameters  to  specify  a  short  segment  (frame)  of  the  speech  signal.  Often  these 
short  segments,  which  are  sometimes  called  analysis  frames,  overlap  one  another.  On  the 
output,  the  speech  signal  is  created  by  concatenating  sequential  frames  for  synthesis  with 
an  overlap-and-add  method.  The  speech  parameters  of  our  synthesizer  are  obtained  by 
the  analysis-by-synthesis  procedure,  in  which  the  analysis  denotes  the  process  of 
estimating  the  parameters  that  characterize  the  speech  signal  and  the  synthesis  denotes  the 
process  of  replicating  the  speech  signal  by  controlling  and  updating  these  parameters 
under  the  supervision  of  the  speech  production  model. 

2.3.1  Speech  Parameters 

Before  we  discuss  the  analysis  schemes  for  voice  conversion,  the  speech 
parameters  are  introduced  in  this  section.  Based  on  the  proposed  synthesizer,  the  speech 
parameters  are  grouped  into  three  categories:  1)  the  excitation  control  parameters,  2)  the 
excitation  source  parameters,  and  3)  the  resonant  tract  parameters.  During  the  synthesis 
process,  the  last  two  groups  of  parameters  are  updated  at  the  beginning  of  every  pitch 
period,  while  the  first  one  controls  the  updating  rate.  The  relationship  between  these 
parameters  and  our  speech  synthesizer  is  illustrated  in  Figure  2-13. 


40 


pitch 

parameters 


source 
parameters 


pitch 
period 
modulator 


polynomial 
model 

transformed 
LF 
model 

gain 

parameters 


gam 


resonant 
parameters 


formant 
cascade 
configuration 


LPC 
polynomial 
configuration 


(a)  voiced  sounds. 


stochastic 
codebook 


gain 


LPC 
polynomial 
configuration 


synthetic 
speech 


(b)  unvoiced  sounds. 


Figure  2- 


13.  The  proposed  hybrid  synthesizer  for  voice  conversion. 

(a)  speech  synthesizer  for  voiced  sounds. 

(b)  speech  synthesizer  for  unvoiced  sounds. 
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These  sets  of  parameters  will  be  altered  or  modified  independently  in  the 
parameter  modifier,  as  the  key  to  simulate  the  voice  modification/conversion  process. 
Note  these  parameters  are  model-dependent.  For  example,  the  source  parameter  is 
only  used  to  specify  the  glottal  waveform  shape  of  the  LF  model,  it  cannot  be  transformed 
to  control  the  waveform  shape  of  the  polynomial  model. 

2.3.1.1  Control  parameters 

There  are  three  types  of  parameters  that  control  the  excitation  functions: 

1.  Voiced/unvoiced  classification,  Vc. 

The  voiced/unvoiced  classification  determines  which  synthesis  scheme  (voiced  or 
unvoiced )  is  adopted  for  each  synthesis  frame. 

2.  Gain  parameter,  g. 

The  human  aural  system  is  sensitive  to  the  intensity  of  speech,  the  gain  parameter 
(either  voiced  or  unvoiced)  is  needed  to  control  the  intensity  of  synthesized  speech. 

3.  Pitch  parameters,  Pp. 

The  pitch  (period)  parameter,  pc,  is  the  parameter  which  determines  the  length  of 
the  glottal  excitation  waveform.  For  unvoiced  sounds,  the  pitch  period  is  fixed  as  5ms. 

2.3.1.2  Excitation  parameters 

Either  the  parameters,  Co,  Ci,  C2,  C3,  C4,  C5,  C6,  for  the  6th  order  polynomial  model, 
or  the  parameter,  Rj,  for  the  transformed  LF  model,  specify  the  shape  of  the  glottal 
waveform.  Note  that  only  one  type  of  source  model  is  employed  in  speech  synthesis,  the 
user  selects  the  model  to  be  used. 

2.3.1.3  Resonant  tract  parameters 

For  synthesizing  voiced  sounds,  the  parameters,  fi,  fz,  fs,  f4,  fs,  bi,  bz,  bs,  b4,  bs, 
determine  the  resonant  frequencies  and  bandwidths  in  Hz  for  the  first  five  resonators  of 
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the  vocal  tract  for  the  formant  configuration,  while  the  parameters,  ai,  a2,  a3,  a*,  as,  a^,  a?, 
ag,  SLg,  aio,  an,  ai2,  ai3,  represent  the  LP  coefficients  for  the  LP  configuration. 

2.3.2  Speech  Analysis 

The  analysis  process  estimates  the  parameters  that  characterize  the  speech  signal 
under  the  supervision  of  the  speech  production  model.  Figure  2-14  shows  the  functional 
block  diagram  of  our  speech  analysis.  An  LP-based  analysis  scheme  is  employed  in  this 
research,  since  an  all-pole  system  is  used  to  represent  the  resonant  tract  in  our  synthesizer. 
The  LP  analysis  is  widely  accepted  as  the  basis  for  many  practical  speech  studies,  because 
most  of  the  speech  parameters  can  be  derived  via  LP  analysis  (Markel  and  Gray,  1976; 
Rabiner  and  Schafer,  1978). 

2.3.2.1  Frame-based  LP  analysis 

This  section  describes  the  initial  frame-based  LP  analysis  procedure,  illustrated  in 
Figure  2-15.  The  first  block  is  to  pre-process  the  speech  signal.  The  speech  signal  is 
normalized  via  dividing  by  the  maximum  amplitude,  and  segmented  into  25ms  frames 
with  a  5ms  over-lap.  If  the  final  frame  is  less  than  25ms,  random  noise  (30  dB  below  the 
peak  amplitude)  is  appended  to  the  frame  to  fill  it  out.  The  speech  signal  is  then  filtered 
by  a  zero-phase  filter,  H(z),  to  remove  the  low  frequency  drift.  This  filter  is  given  by 

The  second  block  is  the  fixed-frame  LP  analysis.  A  linear  predictor  of  13th  order 
is  chosen  for  our  speech  data  (sampled  at  10  kHz)  and  an  orthogonal  coviance  method 
(Ning  and  Whiting,  1990)  is  used  to  calculate  the  LP  coefficients.  The  residue  in  the 
overlapped  area  is  obtained  by  weighting  the  forward  and  backward  overlapping 
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Figure  2-14.  Block  diagram  of  our  speech  analysis. 
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Figure  2-15.  The  procedure  for  initial  frame-based  LP  analysis. 
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sequences  as. 


=  ^^Wr^^^<^)     NTT^b(i)         i=l,  2, N  (2-17) 


where  ef(i),  eb(i)  denotes  the  forward  and  backward  residue  signals,  respectively,  e(i)  is 
the  resulting  residue  signal  for  the  overlapped  area  of  length  N. 

The  third  block  is  to  classify  the  frame  as  voiced  or  unvoiced  using  the  residual 
signal.  Since  there  are  two  types  of  excitation  sources  in  the  synthesizer,  only  one  bit  of 
information  (V/U)  is  needed.  The  algorithm  for  determining  the  voicetype  is  simple.  If 
the  energy  of  the  underlying  signal  is  below  a  specified  value,  this  segment  of  signal  is 
classified  as  unvoiced.  Otherwise,  we  examine  its  spectral  tilt  by  calculating  the  first 
reflection  coefficient.  The  signal  is  voiced  if  the  first  reflection  coefficient  is  larger  than 
0.3  (Hu,  1993). 

The  last  block  is  to  detect  the  pitch  period  and  the  glottal  closure  instants.  It  has 
long  been  noted  that  the  sharp  peaks  in  the  residue  signal  generally  coincide  with  the  GCIs 
for  a  wide  variety  of  voiced  sounds  (Atal  and  Hanauer,  1971).  The  detection  can  be  done 
by  picking  the  peaks  in  the  voiced  residue  signal.  However,  in  order  to  increase  the 
accuracy,  the  residue  signal  is  first  lowpass  filtered  by  the  filter, 

-         "<"°(l-0.9.-)'(l-0.7z-)  ■  f^-'*' 
And  the  pitch  is  estimated  from  the  cepstrum  of  the  residue  signal.  After  finding  the  pitch 

period,  the  most  negative  peak  in  the  neighborhood  of  the  pitch  period  in  the  smoothed 

residue  is  picked  as  the  glottal  closure  instant.  With  the  GCI  sequence,  the  speech  signal 

can  be  segmented  pitch-synchronously. 
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2.3.2.2  Resonant  tract  esrimation 

This  section  describes  the  procedure  for  estimating  the  resonant  tract.  Using  the 
pitch  period  and  the  GCI  sequence,  the  pitch  synchronous  LP  coefficients  can  be  obtained 
by  quadratic  interpolation  (Hu,  1993).  Note  that  a  stability  check  is  performed  for  the 
interpolated  LP  coefficients  since  this  interpolation  does  not  guarantee  a  stable  AR  filter. 
Namely,  the  roots  that  are  located  outside  of  the  unit  circle  are  reflected  inside  of  the  unit 
circle  (Oppenheim  and  Schafer,  1989). 

For  a  formant  configuration,  the  first  five  formants  have  to  be  estimated  from  the 
LP  polynomial.  Numerous  studies  have  estimated  the  formants  based  on  the  LP  spectra 
(Olive,  1971;  McCandless,  1974;  Childers  and  Lee,  1991).  One  method  of  estimating 
formants  is  to  factor  the  LP  polynomial  and  assign  the  appropriate  roots  to  simulate  the 
resonances  of  the  vocal  tract.  For  our  analysis,  a  thirteen  order  LP  polynomial  provides 
thirteen  roots.  Ten  of  these  roots  belong  to  the  vocal  tract  and  the  others  belong  to  the 
glottal  source  or  radiation  filter.  One  of  the  tasks  is  to  find  the  redundant  roots  for  each 
pitch  period. 

Based  on  the  synchronous  LP  coefficients,  a  formant  estimation  procedure  is 
developed  and  shown  in  Figure  2-16.  The  details  are  given  below. 

1.  Estimate  the  formants:  Solve  the  LP  polynomial  and  estimate  the  formant 
frequencies  and  bandwidths  of  the  roots  by  using  Equation  (2-2). 

2.  Delete  roots  with  formant  frequency  range  restriction:  Since  the  formants 
represent  the  resonances,  the  zero  frequency  roots  can  not  be  formants. 
Generally  speaking,  those  roots  with  ft-equency  less  200  Hz  or  larger  than  4700 
Hz  are  not  considered  valid  for  simulating  the  vocal  tract. 

3.  Delete  roots  with  bandwidth  restriction:  Those  roots  with  a  resonant  bandwidth 
larger  than  800  Hz,  or  a  bandwidth  to  frequency  ratio  larger  than  0.8  are  not  valid 
for  simulating  the  vocal  tract. 
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formant  root  estimation 
from  LP  polynomial 


remove  formant  roots  with: 

1 .  formant  frequency  range 

2.  formant  bandwidth  restriction 

3.  spurious  check 

4.  formant  number  restriction 


formant  allocation 

formant  interpolation 

formant  track  smoothing 

Figure  2-16.  The  procedure  to  estimate  the  formant  tracks. 
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4.  Delete  roots  with  spurious  check:  When  the  resonant  frequency  of  two  roots  are 
close  to  each  other  (less  than  200  Hz)  and  the  one  root  has  a  bandwidth  larger  than 
450  Hz,  the  root  with  larger  bandwidth  is  not  a  formant  candidate.  If  there  are 
more  than  3  roots  located  in  the  half-side  (either  less  or  larger  than  25(X)  Hz),  the 
one  with  largest  bandwith  is  considered  as  a  spurious  root. 

5.  Delete  roots  with  formant  number  restriction:  After  step  2,  3,  and  4,  for  each 
individual  frame,  if  less  than  6  roots  remain,  then  go  to  next  step.  Otherwise, 
delete  the  root  with  largest  bandwidth  to  frequency  ratio. 

6.  Allocate  formants:  After  every  frame  has  five  or  less  formants,  the  formants  are 
allocated  in  frequency  ascending  order.  For  the  frames  with  less  than  5  formants, 
the  formants  are  allocated  in  accord  with  their  nearest  five-formant  frame. 

7.  Interpolate  the  formants:  The  vacant  formant  slots  are  filled  by  the  use  of  the 
linear  interpolation  of  their  neighboring  formants.  Note  that  the  interpolated 
formant  cannot  conflict  with  the  frequency  ascending  order.  If  so,  delete  all  the 
conflicted  formants  and  go  back  to  step  6. 

8.  Smooth  the  formant  track:  The  formant  track  is  constructed  by  concatenating  the 
sequential  frames.  However,  the  formant  on  the  track  should  not  deviate  more 
than  10%  from  both  side  of  neighboring  frames.  If  so,  the  formant  is  considered 
as  spurious  and  should  be  deleted.  Go  back  to  step  6  unless  there  are  no  empty 
formant  slots  left. 

2.3.2.3  Excitation  waveform  estimation 

This  section  describes  the  procedure  for  estimating  the  excitation  waveform. 
Depending  on  the  voicetype  classification,  there  are  two  types  of  excitation  waveform  to 
be  estimated.  For  unvoiced  sounds,  the  residue  signal  is  used  to  find  the  optimal 
codeword  for  our  stochastic  codebook.  For  voiced  sounds,  glottal  inverse  filtering  is  a 
popular  and  efficient  method  for  estimating  the  glottal  waveform.  It  is  based  on  the 
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assumptions  that  the  source  excitation  and  supraglottal  loading  are  separable  and  that  the 
source  properties  of  the  speech  production  model  can  be  uniquely  determined.  The 
principle  of  inverse  filtering  is  to  obtain  the  glottal  flow  by  eliminating  the  effects  of 
vocal  tract  transfer  function  and  lip  radiation  from  the  speech  signal.  Figure  2-17 
presents  the  conceptual  inverse  filtering  model. 

The  glottal  inverse  filter  is  constructed  based  on  the  estimated  resonant  tract 
function.  For  a  formant  configuration,  a  10th  order  inverse  (formant)  filter  can  be 
obtained  from  the  estimated  5  formant  frequencies  and  bandwidths.  In  fact,  the  inverse 
filter  is  the  A(z)  in  Equation  (2-4).  The  glottal  inverse  filtering  is  then  formulated  as 

N-l 

e(n)  =  ^  a(k)s(n  -  k)  (2-19) 

k  =  0 

where  s(n)  and  e(n)  are  the  speech  signal  and  the  differentiated  glottal  waveform, 
respectively,  a(n)  is  the  impulse  response  of  the  inverse  filter. 

For  the  LP  configuration,  the  LP  polynomial  contains  the  spectral  components  of 
the  vocal  tract,  the  lip  radiation  and  the  excitation  source.  Consequentiy,  the  residue 
signal  does  not  appear  to  be  highly  informative  about  the  glottal  source.  However,  the 
integral  of  the  residual,  tends  to  partially  exhibit  the  shape  of  the  differentiated  glottal 
waveform  (Hu,  1993).  That  is,  the  differentiated  glottal  waveform  is  estimated  by 
integrating  the  residue  signal.  To  support  this  claim,  we  perform  the  inverse  filtering 
based  on  a  synthetic  vowel  (produced  by  a  formant  synthesizer  with  LF  source  model),  as 
shown  in  Figure  2-18,  so  that  tiie  similarities  between  the  excitation  waveform  (by  LF 
model)  and  the  integrated  residue  can  be  seen. 

Once  the  differentiated  glottal  waveform  is  obtained,  deriving  the  parameters  for 
our  excitation  waveform  models  (the  polynomial  model  and  the  transformed  LF  model)  is 
rather  simple.  As  illustrated  in  Figure  2-19,  we  adopt  the  following  steps  to  derive  tiie 
parameters  for  the  excitation  waveform  model: 
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speech  differentiated  glottal 

signal  glottal  waveform  waveform 


1 

R(z) 

V(z):  vocal  tract  U^ansfer  function 
R(z):  lip  radiation  ti^ansfer  function 


Figure  2-17.  Block  diagram  of  glottal  inverse  filtering. 
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excitation  waveform  by  LF  model 


synthesized  vowel  /i/ 


integrated  residue 


Figure  2-18.  Illustration  of  the  similarity  between  the  differentiated  glottal  flow 
and  the  integrated  residue  signal.  Waveforms  from  top  to  buttom 
are:  (1)  excitation  waveform  by  LF  model,  (2)  synthesized  vowel 
N,  (3)  residual  signal,  and  (4)  integral  of  the  residual  signal. 
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0.  Prepare  the  speech  signal. 


1.  Estimate  the  differential  glottal  waveform  by  inverse  filtering. 


2.  Generate  the  glottal  waveform  with  normaUzed  ampUtude. 


3.  Apply  the  weighting  function  to  the  glottal  waveform. 


4.  Fit  the  glottal  waveform  with  the  polynomial  model. 


Figure  2-19.  A  procedure  for  estimating  one  glottal  waveform 
with  the  polynomial  waveform  model. 
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1.  Estimate  the  differentiated  glottal  waveform  by  inverse-filtering  the  speech 
signal.  Segment  the  waveform  at  the  glottal  closure  instants. 

2.  Normalize  the  amplitude  of  the  integrated  signal  by  setting  the  value  of  the 
starting  point  to  zero  and  divide  the  signal  by  the  largest  amplitude. 

3.  Apply  the  weighting  function  to  emphasize  the  polynomial  fitoess  around  the 
glottal  closure  instinct  (GCI). 

4.  Fit  the  signal  with  the  polynomial  waveform  model  in  a  least  square  sense.  Or 
estimate  uq,  Ce  from  the  glottal  waveform  for  the  transform  LF  model. 

2.3.2.4  Analysis  procedures  for  LP  synthesizer 

In  this  section  we  summarize  the  procedure  for  estimating  the  speech  parameters 
for  an  LP  synthesizer.  They  are: 

1.  Pre-process  the  speech  signal  with  a  highpass  filter.  Segment  the  speech  signal 
into  a  250-sample-length  frame  with  50  samples  overlapped.  Append  a  random 
noise  signal  to  the  last  frame  if  the  data  length  is  less  than  250. 

2.  Perform  a  fixed  frame  LP  analysis  (covariance  method)  on  each  frame.  Two 
types  of  information  are  determined  for  each  frame:  1)  the  LP  coefficients,  and 
2)  the  "prediction  error  signal."  Note  that  the  residue  is  created  by  an 
overlap-and-add  method,  since  there  are  50  overlapped  samples  between 
adjacent  analysis  frames. 

3.  Classify  each  frame  into  a  voiced  frame  or  unvoiced  frame.  Concatenate  the 
sequential  voiced  frames  into  a  voiced  region.  Do  the  same  for  the  unvoiced 
frames. 

4.  Detect  the  pitch  period  and  glottal  closure  instant  for  the  voiced  region. 
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5.  Integrate  and  low-pass  filter  the  residual  signal  to  form  the  differentiated  glottal 
waveform  for  the  voiced  region.  Segment  the  waveform  with  GCIs  into  pitch 
periods.  Then  estimate  the  parameters  for  the  source  model  for  each  pitch  period. 

6.  Interpolate  the  frame-based  LP  coefficients  for  each  pitch  period  for  the  voiced 
speech. 

The  block  diagram  of  the  analysis  procedure  for  this  type  of  synthesizer  is 
depicted  in  Figure  2-20. 

2.3.2.5  Analysis  procedures  for  formant  synthesizer 

The  procedure  for  obtaining  the  speech  parameters  for  an  formant  based 
synthesizer  is  very  similar  to  the  discussion  above.  In  addition  to  those  steps,  we  have  to 
estimate  the  five  formants  for  each  frame  and  construct  an  inverse  filter  to  estimate  the 
glottal  waveform.  To  be  specific,  we  use  the  following  steps  to  get  the  speech  parameters: 

Steps  1^  are  the  same  as  above. 

5.  Solve  for  the  roots  of  the  LP  polynomial  for  each  frame  and  estimate  the  first  five 
formants.  Smooth  the  formant  tract  for  each  voiced  region. 

6.  Construct  the  formant  filter  for  each  frame  and  inverse  filter  the  speech  signal 
with  this  filter.  The  resulting  signal  is  the  estimated  differential  glottal 
waveform.  Segment  the  waveform  and  estimate  the  source  parameters  for  each 
pitch  period. 

7.  Interpolate  the  frame-based  formant  information  for  each  pitch  period. 

The  analysis  procedures  for  the  formant  based  synthesizer  are  illustrated  in  Figure 

2-21. 

2.3.3  Svnthesis  Procedures 

Speech  synthesis  is  the  process  of  reconstructing  speech  signals  by  controlling  and 
updating  the  parameters  of  a  speech  production  model  estimated  in  speech  synthesis.  The 
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prepare  speech  signal 
for  analysis 


fixed  frame  LP  analysis 
(1.  LP  coefficients) 


 \'.  

classify  the  VAJ  region 


 1  

detect  pitch  period  and  GCIs 
(2.  control  parameters) 


estimate  glottal  waveform 
(3.  glottal  source  parameters) 


2-20.  Block  diagram  of  the  analysis  procedures  for  an  LP  based  synthesizer. 
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prepare  speech  signal 
for  analysis 

perform  fixed-frame  LP  analysis 

classify  the 

V/U  region 

I 


detect  pitch  period  and  GCIs. 
(1.  control  parameters) 


estimate  formants 


perform  glottal  inverse  filtering 


interpolate  the  formant 
for  each  pitch  period 
(2.  formant  information) 


estimate  glottal  waveform 
(3.  glottal  source  parameters) 


Figure  2-21.   Block  diagram  of  the  analysis  procedures 
for  an  formant  based  synthesizer. 
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synthesis  of  unvoiced  speech  is  straightforward  and  can  be  easily  accomplished  by 
exciting  the  time-varying  all-pole  filter  with  the  gain-adjusted  innovation  sequence 
sequentially.  On  the  other  hand,  the  synthesis  of  voiced  speech  is  rather  complicated 
because  we  have  to  generate  the  excitation  waveform  from  the  source  parameters  and  the 
updating  process  is  controlled  by  the  control  parameters.  Therefore,  most  of  this  section 
is  focused  on  the  synthesis  schemes  for  voiced  speech. 

2.3.3.1  Interpolation  of  the  glottal  phase 

In  our  synthesizer,  only  one  excitation  waveform  is  used  to  excite  the  resonant 
tract  transfer  function  for  each  pitch  period.  This  can  results  in  large  discontinuities  of  the 
glottal  phase  characteristics  at  frame  boundaries.  Since  the  glottal  phase  is  a 
manifestation  of  the  source  parameters,  we  therefore  apply  an  IIR  filter  to  eliminate  the 
rapid  changes  in  the  source  parameters  as  follows, 

Pi  =  0.5*Pi_i  +  0.5*Pi  (2-20) 

where  pj  and  pj_j  are  the  filtered  source  parameter  for  the  ith  and  (i-l)th  pitch  periods, 
respectively,  and  pj  is  the  source  parameters  for  the  current  frames.  For  the  initial  state, 
use  the  mean  value  of  the  source  parameters  to  replace      i . 

2.3.3.2  Superposition  of  vocal  noise 

Vocal  noise  is  important  for  the  naturalness  of  synthesized  speech,  especially  for 
breathy  and  female  voices  (Klatt,  1987;  Pinto  et  at.,  1989).  Refer  to  Figure  2-6  and 
Figure  2-9,  the  modeling  process  for  the  excitation  waveform  actually  is  one  type  of 
smoothing  processes.  Namely,  the  source  parameters  contain  only  the  modeling  errors, 
not  the  turbulent  noise  that  is  evident  in  the  natural  excitation  waveform  (Homes,  1976; 
Titze  et  al.,  1987;  Eskenazi  et  al.,  1990;  Pinto  and  Titze,  1990).  Therefore  we  add  white 
noise  to  the  synthetic  (voiced)  excitation,  while  the  amplitude  of  the  noise  is  adjusted  to 
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achieve  a  signal-to-noise  ratio  (SNR)  of  25  dB.  The  noise  is  produced  by  modulating 
uniformly  distributed  white  noise  with  a  Gaussian  window  given  by 


where  /  is  the  pitch  period.  We  choose  a  as  0.25  empirically.  Figure  2-22  shows  an 
example  of  adding  white  noise  to  an  excitation  waveform. 

2.3.3.3  Gain  adjustment 

The  gain  parameter  is  a  function  that  modulates  the  power  of  the  excitation 
waveform  and  it  is  an  important  factor  affecting  synthesis  quality.  Ideally,  the  power  of 
the  synthesized  speech  should  be  equal  to  the  power  of  the  original  speech.  Since  an 
all-pole  filter  is  used  in  our  synthesizer,  it  is  not  straightforward  to  regulate  the  excitation 
gain  such  that  the  filter  output  gain  is  equal  to  that  of  the  original  signal.  In  this  section 
we  introduce  our  method  of  adjusting  the  excitation  gain  from  the  recorded  gain 
parameter. 

For  an  all-pole  filter,  the  filter  output  can  be  decomposed  into  two  components: 
one  results  from  the  input  sequence,  and  the  other  results  from  the  filter  memory.  For 
simplicity,  we  could  perform  the  filtering  process  with  a  zero  initial  state,  i.e.  no  memory, 
but  this  would  result  in  a  discontinuity  when  concatenating  the  output  signal  sequentially. 
The  general  strategy  for  solving  this  kind  of  discontinuity  problem  is  to  use  a 
superposition  method  (Moulines  and  Charpentier,  1990;  Hu,  1993).  That  is,  for  each 
pitch  period  there  are  two  synthesis  filters  employed;  one  holding  the  previous  LP 
coefficients  accounts  for  the  memory  contribution,  and  the  other  possessing  the  new  LP 
coefficients  is  responsible  for  the  current  excitation.  The  synthesized  speech  is  the 
combination  of  these  two  outputs. 

In  our  speech  synthesis  algorithm,  we  use  only  one  (vocal  tract)  filter  for  each 
pitch  period,  but  the  length  of  the  excitation  pulse  is  extended  to  two  times  by  appending 


+  0.5,      -  //2  <  n  <  1/2 


(2-21) 
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(a)  excitation  waveforms  without  adding  white  noise. 


(b)  Adding  white  noise  on  excitation  waveforms 


Figure  2-22.  An  example  of  adding  white  noise  on  excitation  waveforms. 
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zeros.  Consequently,  the  output  of  the  filter  is  twice  as  long  as  the  pitch  period.  The  first 
part  of  the  output  is  responsible  for  the  current  excitation,  and  the  second  part  accounts  for 
the  dying-out  excitation,  that  is,  the  memory  contribution  for  the  next  period.  And  the 
gain  for  this  excitation  is  determined  by  subtracting  the  memory  contribution  from  total 
power.  In  sum,  the  synthesized  speech  for  this  pitch  period  is  the  current  filter  output  plus 
the  memory  contribution  from  the  previous  excitation.  The  proposed  strategy  for  gain 
adjustment  is  illustrated  in  Figure  2-23. 

The  gain  adjustment  algorithm  is  formulated  as  follows.   Suppose  there  is  an 
excitation  signal,  x(n),  the  output  of  an  all-pole  filter,  gx(n),  can  be  written  as 

,  P  ■■ 

gx(n)  =  2]  ai  gx(n  -  i)  +  x(n),     0  <  n  <  2M  -  1  (2-22) 
i=l 

where  the  ai  are  the  filter  coefficients,  p  is  the  filter  order,  and  M  is  the  length  of  the 
current  pitch  period.  Given  a  segment  of  original  speech  s(n)  with  M  samples,  the  power 
Pr  is  given  by 


If  the  power  of  synthesized  speech  is  set  to  be  equal  to  the  original  speech,  the  gain,  Aj 
can  be  determined  fi"om  the  following  equation, 

M 


M 


(2-23) 


n  =  l 


(2-24) 


n  =  l 


M 


n  =  l 


where  s'(n)  is  the  memory  contribution  ft-om  previous  pitch  period.  In  fact,  s'(n)  is  the 
output  of  the  preceding  all-pole  filter  with  zero  initial  state,  and  it  is  written  as 
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I      plus  memory  contribution 
from  the  previous  excitation 


synthesized  speech  for  the  pitch  period 


Figure  2-23 


Illustration  of  the  gain  determination  strategy. 
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p 


s'(n)  =  Ag'  gn'(x)  =  Ag'      ^a/  gn'(n  -  i)  +  x'(n) 


=  s'(n-i)  + Ag'  x'(n), 


i=l 


-  M'  <  n  <  M'  -  1 


(2-25) 


where  x'(n),  aj',  andAg'  are  the  excitation  signal,  the  filter  coefficients  and  the 
excitation  gain  of  the  preceding  pitch  period,  respectively,  and  M'  is  the  preceding  pitch 
period.  Finally,  the  synthesized  speech  is  obtained  by  adding  the  two  signals,  s(n)  and 


This  chapter  has  focused  on  the  mechanisms  for  a  speech  analysis/synthesis 
system  and  the  acoustic  models.  We  introduce  three  submodels  that  model  the  human 
speech  production  for  our  system.  They  are  the  excitation  control  model,  the  excitation 
source  model  and  the  resonant  tract  model. 

Depending  on  the  classification  of  voiced/unvoiced  sounds,  one  of  two  excitations 
is  created  in  the  excitation  source.  If  the  classification  is  voiced,  the  control  model 
determines  the  length  of  pitch  period,  the  starting  time  of  each  excitation  pulse,  as  well  as 
the  excitation  gain.  The  shape  of  excitation  pulse  is  controlled  by  the  source  model.  For 
unvoiced  sounds,  the  stochastic  codebook  supplies  the  excitation  pulse.  For  voiced 
sounds,  the  waveform  model  generates  the  excitation  pulse  according  to  the  source 
parameters. 


s'(n), 


s(n)  =  Ag  gx(n)  +  s'(n),     0  <  n  <  M  -  1. 


(2-26) 


and  the  memory  contribution  to  next  excitation  is 


s'(n)  =  Ag  g,(n),     M  <  n  <  2M  -  1. 


(2-27) 


2.4  Summary 
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We  also  addressed  two  configurations  for  the  resonant  tract  that  models  the 
slow-varying  frequency  response  of  the  vocal  tract.  The  resonant  tract  filter  for  unvoiced 
sound  is  obtained  from  a  13th  order  LP  analysis.  For  voiced  sounds,  the  filter  is  either 
derived  from  LP  analysis  (LP  configuration)  or  a  polynomial  expansion  process  (formant 
configuration)  that  multiplies  5  second  order  polynomials  together,  and  each  second  order 
polynomial  is  associated  with  a  specific  set  of  formant  frequencies  and  bandwidths. 

After  a  brief  review  of  speech  synthesizers,  we  introduced  a  hybrid  structure  of 
our  system.  It  was  shown  that  either  the  LP  speech  synthesizer  or  the  formant  speech 
synthesizer  can  reproduce  high-quality  synthetic  speech,  and  each  scheme  has  its  own 
drawbacks.  Therefore,  we  adopted  a  flexible  scheme  that  included  both  the  formant 
representation  and  the  LP  representation  for  voiced  sounds.  In  other  words,  there  are  two 
types  of  resonant  tract  configurations  for  our  system.  The  user  can  select  either  a  formant 
or  an  LP  configuration  for  voiced  sounds,  while  the  LP  representation  is  used  for 
unvoiced  sounds.  The  user  can  also  select  either  the  polynomial  model  or  the  transformed 
LF  model  for  the  excitation  source.  In  the  last  section,  we  defined  the  speech  parameters 
and  described  the  analysis  and  synthesis  procedures  in  detail. 


CHAPTER  3 
VOICE  MODIFICATION 


We  have  described  the  flexible  speech  analysis/synthesis  system  used  in  this 
research  in  previous  chapter.  Speech  is  synthesized  by  five  types  of  acoustic  features,  and 
these  acoustic  features  are  described  by  parameters  of  the  corresponding  acoustic  models. 
The  resulting  synthesized  speech  is  very  close  to  the  original  speech,  and  sounds  natural 
when  playing  through  load  speakers. 

Since  the  acoustic  features  of  the  speech  signal  are  highly  parameterized,  the 
characteristics  of  the  synthetic  voice  can  be  modified  by  altering  the  acoustic  parameters. 
This  purpose  of  this  chapter  is  to  develop  algorithms  that  modify  the  acoustic  parameters 
to  synthesize  a  designed  voice.  In  the  following  sections,  we  discuss  the  techniques  for 
such  tasks,  as  well  as  the  associated  problems  arising  from  the  implementation. 

3.1  Pitch  Contour  Modification 

The  pitch  period  (i.e.,  the  reciprocal  of  the  fundamental  frequency)  is  one  of  the 
important  acoustic  features  for  assessing  individual  differences  in  voice  quality  perception 
(Krishnamurthy,  1983).  It  is  the  cycle  length  of  the  glottal  folds  vibration.  Since  speech 
production  is  a  dynamic  event,  the  pitch  periods  are  different  from  time  to  time.  In  our 
speech  analyzer,  the  glottal  closure  instants  (GCI)  of  the  speech  signal  are  measured  and 
sorted  into  a  vector  according  to  their  timing  information,  which  is  denoted  as  the  GCI 
sequence.  The  distance  between  GCIs  is  the  length  of  the  glottal  pulse,  that  is,  the  pitch 
period.  For  the  pitch-synchronous  synthesis,  those  instants  determine  the  timing  of 
generating  the  glottal  pulses  and  updating  the  resonant  tract  parameters. 

Our  objective  is  to  find  a  method  that  is  able  to  alter  or  modify  the  GCI  sequence 
in  order  to  create  or  mimic  various  types  of  voices.  Intuitively,  the  pitch  contour  can  be 
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modified  by  altering  the  GCI  consequence.  For  example,  changing  the  length  of  the  GCI 
sequence  will  make  the  synthesized  speech  have  a  different  fundamental  frequency. 
Furthermore,  if  we  modify  the  allocation  of  the  GCI  sequence,  the  synthesized  speech  will 
sound  like  different  intonation  patterns.  For  instance,  decreasing  the  distance  between 
GCI  sequence  will  make  synthesized  speech  sound  like  a  rising  tone;  on  the  other  side, 
increasing  the  distance  will  make  a  falling  tone.  Our  algorithms  are  developed  based  on 
this  observation. 

3.1.1  Proposed  Pitch  Contour  Model 

In  order  to  have  a  systematic  point  of  view,  the  GCI  sequence  is  transformed  into 
the  pitch  contour,  which  is  defined  as  the  GCI  vs.  its  pitch  period.  Namely,  the  horizontal 
axis  of  the  plot  is  the  GCI  sequence  in  milliseconds,  while  the  vertical  axis  is  the  pitch 
period,  the  milliseconds  between  this  GCI  and  next  GCI.  The  last  GCI  is  removed  from 
the  plot  and  it  can  be  reconstructed  from  the  plot.  The  speech  signal,  "we  were  away  a 
year  ago,"  and  its  corresponding  pitch  contour  are  illustrated  in  Figure  3-1. 

From  visual  inspection,  the  pitch  contour  is  thought  to  consist  of  three  independent 
factors,  as  illustrated  in  Figure  3-2.  One  is  the  average  value  of  the  pitch  period  that  can 
be  thought  as  a  constant  in  a  spoken  sentence.  It  is  defined  as  the  fundamental  pitch 
period.  Another  one  is  the  steady-state  or  long-time  fluctuation  of  the  pitch  contour, 
which  can  be  thought  as  the  pattern  of  the  pitch  contour.  In  order  to  distinguish  this  factor 
from  the  pitch  contour,  we  define  it  as  the  "pitch  wave"  in  this  dissertation.  The  pitch 
wave  is  closely  related  to  the  intonation  and  plays  a  complex  role  in  encoding  information 
about  the  feelings  of  the  speaker  in  ways  the  segmental  information  alone  can  never  do 
(Klatt,  1987).  A  third  factor  is  the  short-time  perturbation  along  the  pitch  contour, 
defined  as  the  pitch  (periodl  jitter,  has  been  observed  in  the  clinical  studies  as  a  natural 
event  (Simon,  1927). 
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Figure  3-1.  The  speech  signals  and  their  corresponding  pitch  contours. 

(al)  one  segment  of  speech  signal  and  glottal  closure  instants. 

(a2)  the  corresponding  pitch  contour. 

(bl)  the  whole  speech  signal. 

(b2)  the  corresponding  pitch  contour. 
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Figure  3-2.  The  proposed  unified  pitch  contour  model. 
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One  advantage  of  this  proposed  model  is  that  each  factor  is  closely  related  to 
certain  perceptional  features  and  can  be  independently  controlled.  For  example,  the 
fundamental  pitch  period  can  be  shifted  without  affecting  the  pitch  wave  and  the  pitch 
jitter,  vice  versa.  We  will  discuss  the  effectiveness  of  the  model  in  Chapter  4. 

3.1.2  Analysis  Schemes 

Three  pitch  contour  factors  can  be  obtained  as  follows.  The  fundamental  pitch 
period  is  the  mean  value  of  the  contour.  The  pitch  wave  is  estimated  by  a  5th  order 
medium  filter  after  subtracting  the  pitch  contour  from  the  fundamental  pitch  period.  The 
jitter  is  just  the  standard  deviation  of  the  difference  between  the  pitch  contour  and  the 
pitch  wave. 

Though  the  fundamental  pitch  period  and  pitch  jitter  have  been  described  by 
parameters,  the  pitch  wave  is  still  not  quantified  yet.  For  further  interpretation  or 
modification,  we  would  like  to  classify  the  pitch  wave  into  the  pitch  "patterns,"  which 
specify  the  pitch  movement  over  the  time  axis. 

Many  phenomenological  observations  have  been  collected  about  pitch  motions  in 
English  sentences,  and  hypotheses  have  been  generated  concerning  their  relations  to 
linguistic  constructs  known  as  intonation  and  stress  (Pike,  1945;  Lieberman,  1967).  From 
the  perception  point  of  view,  the  pitch  wave  is  closely  related  to  the  intonation  and  reveals 
the  attitudes  and  feelings  of  the  speaker  in  ways  the  segmental  information  alone  can 
never  do  (Pierrehumbert,  1981).  However,  we  are  not  interested  in  a  quantitative 
analysis,  which  is  not  suitable  for  our  purpose,  since  the  pitch  wave  are  influenced  not 
only  by  intonational  phenomena  but  also  by  nonlinguistic,  physiological  and  acoustic 
factors  of  speech  production.  Rather,  we  try  to  abstract  a  regular  pattern  in  the  pitch  wave 
from  visual  inspection,  and  describe  it  qualitatively  or  approximate  it  by  schematic 
drawings. 
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Several  algorithms  have  been  proposed  to  describe  the  pitch  wave  or  the 
fundamental  frequency  contour  (Mattingly,  1966;  Maeda,  1974;  Fujisaki,  1983).  One 
simple  way  to  model  the  pitch  wave  is  to  encode  the  pitch  wave  as  three  intonational 
patterns  that  represent  the  "tune"  of  a  clause.  The  pitch  pattern  distinguishes  statement 
from  question  or  imperative,  or  marks  the  continuation  rise  (fall)  between  clauses  for  an 
utterance  of  more  than  one  clause.  These  three  patterns,  shown  in  Figure  3-3,  are  falling, 
rising  and  baseline,  which  correspond  to  stress  building,  stress  release  and  continual 
connection  of  linguistic  constructs. 

From  visual  inspection,  the  pitch  pattern  can  be  approximated  by  a  second  order 
polynomial  function,  such  as 

w(x)  =  a  x^  +  b  X  +  c,  (3-1) 

where  x  =  0,      j_        (t^'  ^  ^"''^  ^  ^ 

and  c  are  the  constants  to  be  determined.  In  order  to  avoid  the  problem  of  discontinuity, 
we  introduce  two  constraints, 

w(0)  =  p(0), 
w(l)  =  p(T), 

where  p(0)  and  p(T)  are  the  measured  value  at  the  beginning  and  ending  point, 
respectively.  Because  of  the  constraints,  only  one  degree  of  freedom  is  available.  Using 
the  least  square  principle,  the  coefficients  can  be  solved  as  follows. 


(3-2) 
(3-3) 
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(b)  rising  tune. 


(c)  baseline  tune. 


Figure  3-3.  Three  types  of  pitch  patterns. 
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c  =  pi  =  p(0), 


a  +  b  =  Pn  =  p(T), 


N 


(3^) 
(3-5) 
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Furthermore,  the  pitch  pattern  can  be  classified  as 

{a  +  b  >  0       rising  tune, 
a  +  b  <  0       falling  tune,  (3-7) 
a  =  0  -*  baseline  tune. 

An  example  of  modelling  a  pitch  wave  by  our  rules  is  given  in  Figure  3^.  Note 
that  the  segmentation  of  the  pitch  pattern  is  done  by  hand  from  visual  inspection. 

In  our  studies,  we  also  include  a  third  order  spline  function  to  approximate  the 
pitch  wave.  The  characteristic  of  the  spline  function  is  that  the  function  and  its  first 
derivative  are  continuous  at  the  beginning  and  ending  points.  However,  the  computation 
is  not  as  easy  as  the  second  order  polynomial. 

In  sum,  the  three  factors  of  our  proposed  pitch  contour  model  are  estimated  as 
follows. 

1.  Transform  the  GCI  sequence  into  the  pitch  contour.  Find  its  mean  value  and 
denote  it  as  the  fundamental  pitch  period. 

2.  Smooth  the  pitch  contour  by  a  fifth  order  medium  filter  and  subtract  it  from  the 
fundamental  pitch  period.  The  resulting  contour  is  the  pitch  wave. 
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Figure  3-4.  An  example  of  modeling  the  pitch  wave. 
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3.  Subtract  the  original  pitch  contour  from  the  pitch  wave  and  fundamental  pitch 
period.  Calculate  the  standard  deviation  of  the  resulting  sequence  and  denoted 
it  as  the  pitch  jitter. 

4.  Hand-on  segment  the  pitch  wave  into  pitch  patterns.  Each  pattern  is  then 
approximated  by  a  second  order  polynomial  or  third  order  spline  function. 

3.1.3  Modification  Schemes 

Once  we  obtain  these  factors,  they  can  be  modified  independendy  as  follows. 

1 .  The  fundamental  pitch  period.  This  value  can  be  scaled  up  or  down  by  a  scalar. 
In  general,  the  fundamental  pitch  is  in  the  range  of  4-12  ms.  On  average,  female 
speakers  use  the  fundamental  pitch  periods  about  0.59  times  male  values 
(Peterson  and  Barney,  1952). 

2.  The  pitch  wave.  The  pitch  wave  is  segmented  manually  into  several  pitch 
patterns,  and  each  pattem  is  described  by  a  suitable  polynomial  function.  The 
pattern  can  then  be  modified  to  a  different  tune  by  changing  the  coefficients  (a, 
b,  c)  or  schematically  drawing. 

3.  The  pitch  jitter.  This  factor  can  be  scaled  up  or  down  by  a  scalar.  It  is 
recommended  that  it  is  in  the  range  of  0%  to  200%  of  its  measured  value. 

3.1.4  Synthesis  Schemes 

The  GCI  sequence  for  speech  synthesis  can  be  reconstructed  by  reversing  the 
above  procedures.  In  order  to  avoid  discontinuity,  the  time  span  of  the  reconstructed  GCI 
sequence  should  be  about  the  same  as  the  original,  otherwise  the  excitation  pulse  will  leak 
into  the  unvoiced  region.  The  procedures  for  synthesis  are  listed  below: 

1.  Build  the  basic  pitch  contour  based  on  the  fundamental  pitch  period  in  original 
GCI  time  axis.  Namely,  all  the  vertical  values  of  each  GCI  are  set  to  be  the 
fundamental  pitch  period. 
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2.  Introduce  the  jitter  component  on  the  contour. 

3.  Impose  the  pitch  wave  on  the  above  contour.  This  is  the  new  pitch  contour  for 
synthesis. 

4.  Reconstruct  the  GCI  sequence  from  the  pitch  contour.  Start  with  the  first  GCI 
and  its  pitch  period.  Locate  the  second  GCI  which  is  away  from  the  first  GCI  by 
the  pitch  period  of  the  first  GCI.  Continue  this  process  to  locate  the  new  GCI  until 

the  new  GCI  exceeds  the  voiced  speech  region.  ■  ;■ 

.  -  j, 

These  schemes  are  implemented  by  a  graphic  user  interface  software,  VOCOS, 
and  the  usage  of  the  software  is  presented  in  Chapter  5. 

3.2  Gain  Contour  Modification 

In  our  analyzer  the  gain  parameter  records  the  average  value  of  the  speech  energy 
for  each  pitch  period,  and  its  function  is  to  control  the  power  transition  along  an  utterance 
in  speech  synthesis.  This  parameter  is  closely  related  to  the  intensity  or  sound  pressure  in 
the  acoustic  theory.  And  it  is  also  directly  related  to  its  loudness  in  perception.  As  the 
gain  is  increased,  the  synthesized  speech  is  judged  by  listeners  to  be  louder.  Our  objective 
is  to  find  a  method  that  is  able  to  alter  or  modify  the  gain  in  order  to  create  or  mimic 
various  types  of  voice. 

3.2.1  Analysis  Scheme 

For  voiced  speech,  the  gain  parameter  is  pitch  synchronous,  therefore  a  gain 
contour  can  be  constructed  similar  to  the  pitch  contour.  The  gain  contour  is  defined  as  the 
gain  value  vs.  its  GCI  in  time  axis.  Note  that  the  speech  signal  is  pre-normalized  to 
facilitate  the  automatic  analysis  process.  As  a  result,  the  average  (gain)  value  of  the 
speech  signal  is  not  a  significant  parameter  to  assess  the  speaker's  personality  in  our 
system.  The  gain  contour  can  then  be  separated  into  two  factors  similar  to  the  pitch 
contour.  One  is  the  steady-state  value  of  the  contour,  which  can  be  thought  as  the  smooth 
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envelope  of  the  voiced  signal  in  time  domain.  Therefore,  it  is  defined  as  the  gain 
envelope.  The  second  one,  the  pitch-to-pitch  variability  in  gain,  is  defined  as  the  gam 
perturbation  of  the  gain  contour.  It  is  closely  related  to  the  shimmer.  These  two  factors 
can  be  estimated  by  the  methods  as  follows. 

1.  Construct  the  gain  contour  from  the  gain  parameter. 

2.  Smooth  the  gain  contour  by  a  fiftii  order  mean  filter.  The  resulting  contour  is  the 
gain  envelope. 

3.  Subtract  the  original  gain  contour  from  the  gain  envelope.  Calculate  the  standard 
deviation  of  the  resulting  sequence  and  denote  it  as  the  gain  perturbation. 

Figure  3-5  shows  the  (voiced)  gain  contour  of  a  recorded  speech  signal  and  its 
corresponding  factors.  The  gain  envelop  is  modeled  by  Equation  (3-1)  as  well. 

3.2.2  Modification  Schemes 

The  gain  factors  are  modified  independentiy  as  follows. 

1.  The  gain  envelope.  The  gain  envelope  is  segmented  manually  into  several  gain 
patterns,  and  each  pattern  is  then  described  by  a  suitable  polynomial  function. 
The  pattern  can  then  be  modified  to  a  different  tune  by  changing  the  coefficients 
(a,  b,  c)  or  schematically  drawing. 

2.  The  gain  penurbation.  This  factor  can  be  scaled  up  or  down  by  a  scalar.  It  is 
recommended  that  it  is  in  the  range  of  0%  to  200%  of  its  measured  value. 

3.2.3  Synthesis  Schemes 

The  synthesis  scheme  for  the  gain  contour  is  rather  easy.  The  gain  contour  is  the 
sum  of  the  gain  perturbation  and  the  gain  envelope.  Once  the  gain  contour  in  the  voiced 
region  is  determined,  the  gain  parameter  for  each  pitch  period  can  be  determined  by 
interpolation. 
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Figure  3-5.  An  illustration  of  the  gain  contour  and  its  factors. 
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Note  that  if  the  pitch  contour  has  been  modified,  the  gain  parameter  has  to  be 
modified  accordingly  to  avoid  the  discontinuity.  The  new  parameter  is  constructed  by 
interpolating  the  original  gain  contour  with  the  new  GCI  sequence. 

3.3  Resonant  Tract  Modification 

The  resonant  tract  is  defined  as  the  transfer  function  of  the  sound  pressure  from 
the  lips  of  the  speaker  to  the  volume  velocity  of  the  air  flow  passing  the  vocal  folds.  As 
presented  in  the  previous  chapter,  the  resonant  tract  is  modelled  by  the  filter  coefficients, 
either  an  LP  polynomial  or  a  formant  polynomial,  which  lack  direct  relationship  to  its 
spectral  components.  However,  the  radiuses  and  angles  of  the  roots  of  the  filter 
polynomial  are  direcdy  related  to  the  bandwidths  and  frequencies  of  the  spectrum  peaks, 
respectively.  Consequently,  the  spectral  representation  of  the  resonant  tract  lies  in  the  z 
domain  instead  of  the  polynomial-vector  domain.  In  other  words,  the  first  step  for 
preparing  resonant  tract  modification  is  to  compute  the  roots  of  the  filter  polynomials. 

With  formant  roots,  the  formant  frequency  and  bandwidths  can  be  obtained  by 
using  Eq.  (2-1).  The  formant  track  is  then  constructed  by  concatenating  formant 
frequencies  in  time  axis.  Figure  3-6  shows  an  example  of  the  estimated  formant  track  by 
this  method. 

The  formant  track  can  then  be  modified  by  a  scale  factor,  mouse-drawing  or 
copying  from  other  tracks,  and  the  new  filter  coefficients  are  obtained  by  reversing  the 
above  process.  However,  the  direct  construction  of  the  resonant  filter  from  the  formant 
poles  sometimes  results  in  the  spectral  deviation  from  design.  For  example,  the  second 
formant  may  merge  into  the  first  formant  in  the  formant  spectrum  if  these  two  poles  are 
close  enough  in  the  z  domain.  This  is  called  the  pole  interaction  problem  and  a 
pole-compensation  algorithm  is  needed  to  assure  that  the  resulting  formant  spectrum  is 
close  to  the  designed  one. 
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Figure  3-6.  The  formant  track  of  the  sentence,  "we  were  away  a  year  ago", 
spoken  by  a  male  speaker. 
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3.3.1  Pole  Interaction  Theory 

For  a  pole  Zj  with  angle     and  radius    in  the  z-<iomain,  its  power  spectrum  is 
given  by 

1  -  2rjCOs(e  -      +  r? 

where  H(z)  =  Xr  — r.    If  the  sampling  frequency  is  Fg,  the  corresponding 

1  —  rjeJ^Pi  z  1 

resonant  frequency  and  bandwidth  are  determined  by  Equation  (2-1).  Here  we  rewrite  it 
for  convenience, 

Formant  Frequency  ~  ^  *     '  (3-9) 
Formant  Bandwidth  =  cos~^(      ~2r  ~  ^   )/jt*Fs  , 

which  is  calculated  by  finding  the  frequency  where  the  spectral  energy  is  3-dB  below  the 
peak. 

If  there  are  two  poles,  the  power  spectrum  is  formulated  as 

2 

^  n  1  -  2riCOs(e  -  (jjj)  +  r2  "  (3-11) 


i=l 


From  visual  inspection,  the  amplitudes  and  the  shapes  of  both  spectral  peaks  at  angles  (j)} 
and  (j)2  are  different  from  those  of  the  independent  cases,  because  the  new  spectrum  is  the 
product  of  two  individual  spectra,  as  illustrated  in  Figure  3-7.  To  be  specific,  the  spectral 
amplitude  at  the  resonant  angle  (j);  is  multiplied  by 

"^'"'i  "  l-2r.cos8  +  r2  '  (3-12) 
J  J 

where  6  is  the  angle  difference  between  these  two  poles,  and  rj  denotes  the  radius  of  the 
original  pole  (zj)  and  rj  denotes  that  of  the  adding  pole  (zj).  AlHlj  is  defined  as  the  pole 
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Figure  3-7.  Examples  of  pole  interaction. 
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interaction  factor  (PIF)  of  pole  z-  on  pole  Zj,  and  can  be  used  to  measure  the  interaction 
effect.  If  it  is  larger  than  1,  the  corresponding  spectral  peak  (at  the  angle  (Jjj )  is  lifted, 
otherwise  the  peak  is  deflated. 

Furthermore,  a  spectral  valley  is  formed  in  the  region  between  the  poles  such  that 
the  spectral  slope  varies  more  rapidly  in  this  region  than  elsewhere.  Sometimes,  there  is 
only  one  peak  appearing  in  the  spectrum  if  these  two  poles  are  close  enough.  This  is 
called  the  pole  interaction  problem.  Note  that  equations  (3-11)  and  (3-12)  can  be 
extended  to  include  more  poles. 

3.3.2  Formant  Polynomial  Modification  Algorithm 

For  our  application,  the  resonant  filter  is  constructed  from  a  polynomial  with 
appropriate  poles  to  simulate  the  vocal  tract  response.  As  pole  interaction  theory  states, 
shifting  a  pole  in  the  frequency  axis  will  affect  the  spectral  components  not  only  at  the 
immediate  frequency  but  also  at  other  frequencies,  especially  the  nearby  frequencies. 
Therefore  the  resulting  spectral  peaks  may  not  appear  as  designed.  To  remedy  the 
problem,  we  propose  a  formant  polynomial  modification  algorithm  as  follows. 

From  previous  studies,  the  formant  energy  is  more  important  than  the  formant 
bandwidths  in  speech  perception  (Kuwawabara  and  Ohgushi,  1987).  Therefore,  one 
method  to  relieve  the  pole  interaction  problem  is  to  modify  the  formant  bandwidths  such 
that  the  corresponding  formant  energy  (peak)  has  the  same  level  as  designed.  Namely,  the 
radiuses  of  the  formant  poles  are  modified  to  make  the  spectral  energy  of  the  resulting 
formant  polynomial  to  be  equal  to  that  of  the  individual  first-order  LP  polynomial  at  the 
formant  frequency. 

Using  Equation  (3-8)  and  Equation  (3-11),  the  frequency  response  at  the  angle 
can  be  written  as 
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1   1  =  I   (3-13) 

(1  -  ri')2   1  -  2rjCOs((|)i  -       +         (1  -  t-)^ 

where  r^  is  radius  of  the  adjusting  pole,  t-  is  the  radius  of  the  other  pole,     and  are 

their  corresponding  angles,  respectively.  And  t\  is  the  new  radius  for  this  pole  such  that 

the  spectral  energy  is  equal  to  its  independent  frequency  response  at  the  angle  <j)j.  If  r'j 

exceeds  1 ,  using  its  reciprocal  value. 

In  practice,  there  are  more  than  two  poles  located  in  the  z  domain  and  their  effects 
should  also  be  taken  into  account.  Equation  (3-13)  is  extended  to  included  these 
additional  contributions  and  it  is  formulated  as 

1     A         1         _  1 

(1  -  r;')^  U  1  -  2r,cos((|)j  -  t)),)  +  r?     (1  -  r^)^  '  ('-l*) 

J  ^  1  J  J  J 

where  Tj  is  radius  of  the  adjusting  pole,  Tj  is  the  radius  of  the  other  pole,  <t)j  and  are 
their  corresponding  angles,  respectively,  and  N  is  the  number  of  total  poles.  The  new 
radius  of  this  adjusting  pole,  r'j ,  can  be  calculated  from  Equation  (3-14)  such  that  the 
spectral  energy  at  the  angle     is  the  same  as  the  pole  is  independent. 

The  computational  strategy  of  applying  this  modification  to  the  formant 
polynomial  is  explained  as  follows.  The  formant  frequencies  and  bandwidths  are 
transformed  into  formant  poles  by  using  Equation  (3-9)  and  (3-10).  Since  the  formant 
roots  are  complex-conjugate  pairs,  only  those  with  a  positive  angle  need  to  be  modified 
and  their  conjugated  parts  are  obtained  explicitly  at  the  final  stage.  The  formant  pole 
modification  starts  with  the  root  whose  angle  is  smallest  and  its  new  radius  is  obtained  by 
using  Equation  (3-14).  Continue  the  process  until  all  roots  are  modified.  If  necessary, 
this  process  can  be  repeated  again  with  the  new  radiuses.  The  final  formant  polynomial  is 
then  constructed  using  the  modified  poles  and  their  complex-conjugated  pairs. 

In  short,  a  formant  polynomial  modification  algorithm  is  proposed  here  to  reduce 
the  degradation  of  pole  interaction  by  changing  the  radial  distance  of  the  formant  root. 
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The  formant  spectrum  obtained  with  this  modification  is  closer  to  the  designed  spectrum 
than  that  without  modification.  An  example  demonstrating  the  advantage  of  the  proposed 
algorithm  is  given  in  Figure  3-8.  From  the  plot,  the  second  formant  peak  is  separated 
from  the  first  formant  peak  by  the  proposed  algorithm,  while  failed  by  the  direct 
construction. 

3.4  r.lnttal  Pulse  Modification 

The  shape  of  the  glottal  excitation  pulse  has  been  shown  to  greatly  affect  the 
quality  and  naturalness  of  synthesized  speech  (Rosenberg  1971;  Naik,  1984).  In  this 
research,  we  use  two  types  of  excitation  source  model  to  generate  the  glottal  pulse.  As  a 
consequence,  we  adopt  two  methods  to  modify  the  source  parameters. 

For  the  polynomial  model,  there  is  little  physical  relationship  between  the 
parameters  and  the  glottal  pulse  shape.  In  other  words,  it  is  difficult  to  "create"  a  new 
pulse  shape  by  manipulating  the  polynomial  coefficients  from  scratch.  Even  though,  one 
way  to  modify  the  polynomial  parameters  is  to  "copy"  the  characteristics  from  data 
analysis.  For  example,  suppose  we  have  collected  the  excitation  source  parameters  for 
two  speakers,  the  target  speaker  and  the  source  speaker.  A  linear  mapping  function 
between  the  polynomial  parameters  of  these  two  speakers  can  be  determined  by  the  linear 
regression  method.  Therefore,  the  glottal  source  parameters  of  the  source  speaker  is 
modified  to  match  those  of  the  target  speaker  by  the  derived  linear  mapping  function. 

For  the  transformed  LF  model,  the  source  parameters  can  be  modified  either  by 
the  above  linear  regression  method  or  by  the  graphic  user  interface.  Since  the  shape  of  the 
glottal  pulse  is  impUcitly  described  by  the  model  parameters,  it  is  recommended  the 
source  parameters  are  transformed  into  the  LF  timing  parameters  (tp,  te,  ta  and  tc)  for 
preparing  the  modification. 

One  advantage  of  the  LF  model  is  that  it  has  been  used  to  study  how  the  vocal 
quality  is  affected  by  the  shape  of  the  glottal  pulse  (Childers  and  Lee,  1991;  Childers  and 
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(c)  modified  spectrum  by  proposed  algorithm. 


Figure  3-8.  An  example  of  formant  modification:  the  second 
formant  is  shifted  from  972  Hz  to  705Hz. 


85 


Ann  1994;  Shue,  1995).  Previous  researcher  showed  that  there  was  a  difference  in  three 
of  four  LF  timing  parameters  for  three  voice  types,  modal,  vocal  fry  and  breathy. 
Namely,  the  voice  quality  can  be  modified  simply  by  changing  the  LF  timing  parameters. 
The  details  will  be  discussed  in  Chapter  5. 

3.5  Summary 

In  this  chapter  we  have  introduced  the  algorithms  for  modifying  the  parameters  of 
our  acoustic  model  to  simulate  the  voice  feature  conversion  in  the  acoustic  domain.  Since 
the  sets  of  acoustic  parameters  were  thought  to  be  independent,  the  modification 
approaches  were  developed  separately.  A  pitch  contour  model  is  proposed  to  describe  and 
control  the  pitch  contour.  In  that  model,  the  pitch  contour  is  decomposed  into  the 
fundamental  pitch  period,  the  pitch  wave  and  the  pitch  jitter.  Each  of  the  components  is 
closely  related  to  certain  perceptional  features  and  could  be  independently  controlled.  We 
also  addressed  the  analysis,  modification  and  synthesis  schemes  for  that  model.  Similar 
approach  was  adopted  for  the  gain  contour  modification. 

Because  of  the  pole  interaction  problem,  we  introduced  a  formant  polynomial 
modification  algorithm  to  construct  the  formant  filter  from  the  modified  formant  poles. 
The  radiuses  of  the  formant  poles  are  adjusted  to  make  the  spectral  energy  of  the  new 
(modified)  formant  peaks  close  to  that  of  the  previous  (pre-modified)  formant  peaks.  We 
have  shown  that  the  formant  spectrum  constructed  by  the  proposed  algorithm  was  more 
peak-like  than  by  the  direct  construction.  The  method  for  modifying  the  source 
parameters  was  addressed  in  the  final  section. 


CHAPTER  4 
VOICE  CONVERSION 


Our  approach  to  convert  the  voice  of  one  speaker  to  another  speaker  is  presented 
in  this  chapter.  The  objective  is  to  (1)  develop  methods  for  creating  new  synthetic  voices, 
(2)  study  factors  responsible  for  synthetic  voice  quality,  and  (3)  determine  methods  for 
speaker  adaptation.  This  research  is  one  model  for  studying  factors  responsible  for  the 
quality  of  synthetic  speech,  for  mimicking  voices,  and  for  speaker  normalization. 

The  organization  of  this  chapter  is  as  follows:  In  the  first  section,  we  introduce 
two  adaptation  models  to  describe  the  differences  between  two  sets  of  parameters.  With 
these  models,  the  voice  conversion  can  be  realized  by  converting  the  acoustic  parameters 
through  the  mapping  functions.  Our  voice  conversion  algorithms  are  developed  based  on 
this  parameter  transformation  platform,  which  are  described  in  the  second  section.  The 
third  section  describes  several  experiments  to  test  the  performance  of  our  voice 
conversion  algorithms.  This  chapter  is  concludes  by  discussing  the  experimental  results. 

4.1  Speaker  Adaptation  Models 

4.1.1  Translation  Model 

A  general  approach  to  voice  conversion  or  speaker  adaptation  is  to  treat  the 
speaker  differences  as  arising  from  a  parametric  transformation.  The  voice  conversion 
task  is  then  simplified  as  a  mapping  between  the  two  sets  of  parameters. 

Suppose  that  a  sound  (or  phoneme)  can  be  represented  by  a  n-dimensional 
acoustic  feature  vector  and  that  an  estimate  of  the  mean  vector  of  this  sound  over  various 
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speakers  is  [A.  The  sound  for  the  speaker  i  is  denoted  as  a  vector  s'  and  given  by, 


s'  =  (A  +  6 


(4-1) 


Where  the  vector  8  may  be  thought  of  as  a  "bias"  term,  which  is  the  characteristic  of  the 
speaker,  or  more  correctly  the  speaker  plus  acoustic  channel.  Therefore,  this  linear  model 
is  called  the  bias  model  by  Cox  (1995). 

Based  on  this  idea,  voice  conversion  can  be  realized  by  converting  the  parameters 
in  the  acoustic  feature  space.  We  hypothesize  that  the  acoustic  features  are  linear 
independent  from  one  another  and  the  transformation  is  time  invariant.  Thus,  one  can 
convert  the  acoustic  parameters  of  one  speaker  to  those  of  another  speaker,  if  the  offset 
value  between  two  specific  speakers  is  known.  In  other  words, 


Where  X  and  Y  are  the  n-dimensional  acoustic  parameter  vectors  for  the  target  and  source 
speaker,  respectively.  B  is  the  offset  vector,  and  n  is  the  number  of  measured  acoustic 
features.  In  linear  algebra,  this  is  a  standard  translation  between  two  vectors  (Birkhoff 
and  Mac  Lane,  1965).  Therefore,  Eq.  (4-2)  is  called  the  translation  model. 

Our  task  is  to  estimate  B,  which  is  thought  as  the  difference  between  two  speakers 
plus  the  same  channel  effect.  For  simplicity,  we  can  write  Eq.  (4-2)  into  the 
homogeneous  linear  equations  as  follows. 


X  =  Y  +  B 


(4-2) 


xi  =  yi  +  bi 


X2  =  y2  +  ^2 


X 


n 


=  yn  +  b, 


'n 


(4-3) 
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where  Xj  and  yj  are  the  ith  acoustic  feature  for  the  target  and  source  speaker,  respectively, 
and  bj  is  the  corresponding  offset  scalar.  If  m  samples  are  collected  for  the  two  speakers, 
the  value  of  bj  can  be  estimated  by  using  the  least  square  principle.  That  is, 

m 

=  ^^(Xjj,  -  yjk)        i=l,  .  .  .  ,  n  (4-4) 

k=l 

Where  Xy^is  the  kth  sample  of  the  acoustic  feature  Xj,  and  yi^the  kth  sample  of  the 
acoustic  feature  y^. 

4.1.2  Affine  Model 

Although  the  translation  model  has  the  advantage  of  simplicity,  its  core 
assumption  is  that  a  given  speaker's  speech  can  be  modeled  as  a  single  invariant 
transformation  applied  to  the  other  speaker's  speech  may  not  be  sufficient  enough  to 
account  for  the  complexity  of  the  speech  signal  in  the  acoustic  feature  space.  A  more 
sophisticated  model  is  addressed  in  this  section. 

As  stated  in  section  1 .2.2,  the  factors  for  determining  the  voice  characteristics  may 
be  modeled  as  a  linear  function.  The  acoustic  features  of  one  speaker  may  be  modeled  as 
the  linear  combination  of  another  speaker's  features.  That  is, 

X  =  AY  +  B  (4-5) 

where  X  and  Y  are  the  n-dimensional  acoustic  feature  vectors  for  the  target  and  source 
speaker,  respectively.  A  is  a  n  by  n  matrix,  and  B  is  a  n-dimensional  vector.  This  is 
called  the  affine  transformation  between  vectors  X  and  Y  (Birkhoff  and  Mac  Lane,  1965). 
As  before,  we  hypothesize  that  the  acoustic  features  are  linear  independent  from  one 
another  and  the  transformation  is  time  invariant.  As  a  result,  A  becomes  a  diagonal 
matrix.  In  fact,  Eq.  (4—2)  is  a  special  case  of  Eq.  (4-5)  when  A  is  equal  to  a  unity  matrix. 
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In  our  research,  A  and  B  are  estimated  by  the  linear  regression  method  from  the 
speech  samples  of  two  specific  speakers.  With  this  model,  voice  conversion  can  then  be 
realized  by  Eq.  (4—5),  once  A  and  B  are  determined. 

4.1.3  Training  Process  *  '*■ 

The  process  for  finding  the  mapping  function  between  the  target  and  source 
vectors  is  called  the  "training"  process.  Because  of  variations  in  speaking  rate,  we  use  a 
dynamic  time  warping  (DTW)  technique  to  adjust  the  parameters  of  the  source  to  be  in 
accord  with  those  of  the  target  on  the  time  axis.  A  diagram  of  the  training  process  is  show 
in  Figure  4-1.  The  implementation  of  the  DTW  algorithm  is  given  in  Appendix. 

The  training  process  is  described  as  follows.  Two  speakers,  the  source  speaker 
and  the  target  speaker,  pronounce  the  same  sentences  and  the  acoustic  parameters  are 
extracted  from  these  two  signals  via  the  analysis  process.  Each  set  of  speech  parameters 
forms  a  frame-based  vector.  The  source  vectors  are  then  time-aligned  with  the 
corresponding  target  vectors  by  the  DTW  algorithm.  Finally,  we  use  the  linear 
multivariable  regression  (LMR)  algorithm  to  estimate  the  coefficients  of  the  mapping 
function,  either  the  translation  model  or  the  affine  model. 

4.2  Voice  Conversion  Algorithms 

The  voice  conversion  algorithms  used  to  convert  the  speech  of  one  speaker  to 
sound  like  that  of  another  speaker  are  described  in  this  section.  As  discussed  in  section 
1.2,  three  factors  of  speech  production  model  are  responsible  for  characterizing  voice,  the 
dimensions  of  the  vocal  tract,  the  vibratory  patterns  of  the  vocal  folds  and  the  articulatory 
style.  However,  it  is  difficult  to  simulate  voice  variations  caused  by  all  three  factors.  Our 
research  has  focused  on  the  modification  of  the  segmental  parameters  of  the  speech 
signal.  Particularly,  we  are  interested  in  modifying  five  measured  acoustic  features  so  that 
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Figure  4-1  The  training  process  diagram. 
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voice  conversion  process  is  simulated  by  parameter  mapping  process.  Figure  4—2  shows 
the  block  diagram  of  our  voice  conversion  process. 

4.2.1  Strategy 

In  our  speech  production  model,  the  speech  signal  consists  of  five  types  of 
acoustic  features:  the  voicetype  classification,  the  pitch  contour,  the  gain  contour,  the 
vocal  tract  resonances  and  the  glottal  pulse  shape.  Furthermore,  the  features  are  all 
represented  by  parameters  that  can  be  estimated  via  the  analysis  process.  As  described  in 
Chapter  2,  the  speech  synthesized  with  these  parameters  closely  approximates  the  original 
speech.  Therefore,  we  hypothesize  the  voice  conversion  can  be  simulated  by  the 
parameter  transformation  so  that  the  acoustic  features  of  one  speech  (source)  match  the 
desired  (target). 

There  are  four  types  of  mapping  methods  used  in  our  studies,  the  translation 
transformation,  the  affine  transformation,  the  "copy"  method  and  the  "retain"  method. 
Since  we  hypothesize  that  the  acoustic  features  are  independent  from  one  another,  the 
acoustic  features  are  converted  separately.  For  example,  we  may  use  the  bias  method  for 
the  pitch  contour,  the  linear  model  for  the  gain  contour,  and  the  copy  method  for  the 
glottal  pulse  and  the  formant  frequency.  In  fact,  we  can  use  256  different  combinations  to 
study  the  factors  responsible  for  the  quality  of  synthetic  speech. 

The  translation  transformation  and  the  affine  transformation  are  based  on  the 
speaker  adaptation  models  addressed  in  the  previous  section.  The  coefficients  of  the 
mapping  function  are  determined  by  the  linear  multivariable  regression  algorithm.  In 
order  to  reduce  the  matching  error  from  DTW,  we  use  a  two-phase  strategy  to  compute 
the  coefficients  of  the  mapping  function.  The  preliminary  coefficients  are  determined 
from  the  entire  data  in  the  first  phase.  Using  this  result,  we  find  the  10%  most  deviated 
data  and  delete  them  from  the  entire  data  set.  The  final  coefficients  are  then  determined 
from  the  remaining  data  in  the  second  phase. 
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Figtire  4—2  Block  diagram  of  our  voice  conversion  process. 
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The  third  method,  the  "copy"  method,  simply  uses  the  acoustic  features  of  the 
target  speaker  to  synthesize  the  converted  speech.  It  establishes  a  comparison  base  to 
study  the  effectiveness  of  first  two  conversion  methods,  as  well  as  the  factors  responsible 
for  the  quality  of  synthetic  speech.  The  fourth  method  is  to  "retain"  the  source  parameters 
so  that  the  synthesized  speech  contains  certain  features  of  the  source  speech.  However, 
the  source  parameters  are  assoned  with  the  warping  path  in  accord  with  the  target's 
speaking  rate. 

We  are  more  interested  in  studying  the  effectiveness  of  the  first  two  methods  on 
converting  voice,  while  the  other  two  methods  build  a  comparison  base  to  study  the 
effects  of  acoustic  features  on  the  "personality"  of  the  synthesized  speech. 

After  determining  the  mapping  functions,  the  converted  parameters  are  obtained 
by  feeding  the  source  parameters  into  the  functions.  The  converted  speech  is  generated  by 
synthesizing  the  converted  parameters. 

4.2.2  Realization 

Although  the  acoustic  features  are  described  by  parameters,  it  is  not  straight 
forward  to  synthesize  the  speech  parameters.  The  modified  parameters  have  to  be 
consistent  with  the  speech  production  model  in  order  to  produce  high  quality  speech.  The 
modification  details  are  discussed  in  Chapter  3.  Here  we  present  an  overview  of  the 
conversion  algorithms  for  each  acoustic  feature  as  follows: 

1.  Voicetype  conversion.  In  our  system,  there  are  two  types  of  excitation,  voiced 
and  unvoiced.  Since  we  are  concerned  with  voice  conversion  for  voiced  speech 
only,  we  simply  use  the  voicetype  classifications  of  the  target  speech  for  the 
converted  speech.  To  be  specific,  the  voiced  parameters  (pitch,  gain,  glottal  pulse 
and  formants)  are  transformed,  while  the  unvoiced  parameters  (gain,  stochastic 
codeword,  LP  coefficients)  are  copied  from  the  target  to  eliminate  the  unwanted 
noise  in  the  unvoiced  segments. 
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2.  Pitch  contour  conversion.  For  transformation,  we  use  the  average  value  of  the 
pitch  period  of  each  frame  for  two  tokens  (source  and  target).  After  the  training 
process,  the  source  vector  is  converted  into  a  new  value  using  a  specified 
mapping  function.  Since  our  synthesizer  is  synchronous,  the  pitch  vector  is 
transformed  into  the  timing  instants  that  correspond  with  the  glottal  closure 
instants  (GCI).  In  order  to  eliminate  the  unvoiced-to-voiced  transition  noise,  the 
first  GCI  is  fixed  at  the  beginning  point  of  the  voiced  segment.  For  the 
voiced-to-unvoiced  transition,  the  last  GCI  is  extended  into  the  next  unvoiced 
segment.  In  the  voiced/unvoiced  transition  region,  the  voiced  speech  is 
overlapped  and  added  with  the  unvoiced  speech. 

3.  Gain  contour  conversion.  The  gain  parameter  controls  the  excitation  energy  for 
each  pitch  period  and  is  pitch  synchronous.  This  implies  that  the  frame  based 
gain  information  has  to  be  interpolated  for  each  GCI  after  transformation. 
Furthermore,  once  the  pitch  contour  is  modified,  the  gain  contour  has  to  be 
modified,  even  though  the  gain  contour  itself  is  not  changed.  In  order  to 
eUminate  the  discontinuity  from  the  unvoiced-to-voiced  transition,  the  gain 
value  for  the  first  voiced  pulse  is  linearly  interpolated  between  the  last  unvoiced 
gain  and  the  second  voiced  gain,  and  vice  versa. 

4.  Glottal  pulse  conversion.  The  LF  timing  parameters  (tp,  te,  tc,  ta)  have 
certain  constraints,  such  as  te  must  be  larger  than  tp.  If  the  the  converted 
parameters  are  against  the  constraints,  our  synthesizer  uses  the  default  values 
which  are  either  the  previous  LF  timing  parameters  or  the  average  values  of  the 
entire  speech  record. 

5.  Formant  frequency  transformation.  Due  to  the  pole  interaction  problem, 
constructing  formant  polynomials  directly  from  the  converted  formant 
frequencies  and  bandwidths  may  not  result  in  the  desired  formant  spectrum.  In 
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our  transformation  process,  only  the  formant  frequencies  are  converted  by  the 
linear  mapping  function,  while  the  bandwidths  are  determined  by  our  formant 
modification  algorithm  to  maintain  the  formant  structure  (see  Section  3.4  for 
details).  Unlike  other  spectrum  transformation  methods,  our  transformation 
algorithm  is  independent  of  the  pitch  contour  and  the  speaking  rate  (Childers  et 
al,  1989). 

4.3  Experiments 

4.3.1  Overview 

Several  experiments  were  conducted  to  test  the  performance  of  our  voice 
conversion  algorithms.  Table  4-1  lists  the  experimental  settings  for  speech  analysis, 
conversion  and  synthesis.  Speech  tokens  for  the  same  sentence  were  spoken  by  five  males 
and  two  females,  including  two  voices  with  vocal  disorders.  Two  words  and  one  sentence 
were  used  in  this  research,  "you",  "veal",  and  "We  were  away  a  year  ago." 

An  analysis  was  performed  on  each  utterance  spoken  by  each  speaker  and  acoustic 
parameters  were  obtained,  including  five  acoustic  features:  the  voicetype  classification, 
the  pitch  contour,  the  gain  contour,  the  vocal  tract  resonances  and  the  glottal  pulse  shape. 
A  standard  DTW  algorithm  was  used  to  adjust  the  speaking  rate  difference  between  the 
soiuTce  and  target  utterance.  Specifically,  the  Itakura  local  constraint  and  the  Itakura 
distortion  measure  were  used  in  our  studies  to  find  the  optimal  waiping  path.  The 
parameters  of  the  source  were  associated  with  those  of  the  target  according  to  the  warping 
path.  As  illustrated  in  Table  5-1,  four  types  of  the  conversion  methods  were  used  in  the 
experiments.  Once  we  selected  the  conversion  method,  the  coefficients  of  the  mapping 
functions  for  the  cluster  were  determined  by  the  LMR  algorithm.  The  converted 
parameters  were  then  calculated.  Finally  the  converted  speech  was  synthesized  using  the 
converted  parameters.  All  the  experiments  were  conducted  with  our  voice  conversion 
software  system,  VOCOS,  which  is  discussed  in  the  next  chapter. 
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Table  4-1  The  parameter  settings  for  our  experiments. 


Analysis: 

pitch  asynchronous,  fixed  frame  with  rectangular  window 

1 .  sampling  frequency 

10  kHz 

2.  LPC  order 

13 

3.  frame  length 

25  ms 

4.  overlap  length 

5  ms 

5.  source  model 

(a)  polynomial  model 

(b)  LF  model 

6.  tract  model 

(a)  linear  prediction  model 

(b)  formant  model  (5) 

Conversion: 

use  DTW  techniques  to  adjust  the  frame  based  parameters 

1.  local  constraints 

Itakura 

2.  distonion  measure 

Itakura 

3.  speaking  rate 

(a)  target  speaker 

(b)  source  speaker 

Mapping  methods 

1.  pitch  contour 

g-  (a)  translation 

2.  gain  contour 

1    (b)  affme 

3.  glottal  pulse 

1    (c)  copy 

4.  formant  frequency 

(d)  retain 

Synthesis: 

pitch  synchronous;  overlap-and-add 

1.  unvoiced  frame 

LP  synthesizer 

2.  voiced  frame 

(a)  LP  synthesizer 

(b)  formant  synthesizer 

3.  transition  frame 

50%  overlap 
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4.3.2  Experiment  1 

The  test  speech  token  of  this  experiment  was  the  word  "you,"  spoken  by  a  male 
and  a  female.  The  subject  was  to  convert  the  male's  speech  to  sound  like  a  female's. 

An  analysis  was  performed  on  each  utterance  spoken  by  each  speaker.  We 
selected  the  LF  model  and  formant  model  as  the  source  and  the  vocal  tract  model  of  our 
speech  synthesizer,  respectively.  Five  types  of  the  acoustic  features  were  measured  for 
each  speech.  After  the  analysis  phase,  the  optimal  warping  path  between  the  target  and 
source  signal  was  determined  by  DTW,  shown  in  Figure  4-3.  The  next  step  was  to 
associate  the  analyzed  data  according  to  the  warping  path  and  determine  the 
transformation  function.  The  affine  transformation  method  was  selected  to  conven  four 
acoustic  features  of  the  source  speech  to  those  of  the  target.  The  linear  mapping  function 
for  each  parameter  is  plotted  in  Figure  where  the  horizontal  axis  represents  the 
source's  value  and  the  vertical  axis  the  target's  value.  And  the  source  parameters  were 
then  transformed  One  major  finding  is  that  the  spoken  sentence  has  more  dynamical 
changes  than  the  capacity  of  our  linear  mapping  methods,  especially  for  the  formant 
frequency  transformation  according  to  the  mapping  function,  as  illustrated  in  Figure  4-5 
and  4-6.  The  convened  speech  was  generated  by  synthesized  converted  parameters.  The 
speech  waveforms  and  spectrograms  of  the  source  speech,  the  converted  speech  and  the 
target  speech  are  shown  in  Figure  4—7  and  Figure  4—8,  respectively. 

In  order  to  validate  the  mapping  functions,  we  calculated  the  sum  of  squares  for 
error  (SSE)  of  ith  acoustic  feature  and  divided  it  by  (n-2)  degrees  of  freedom,  where  m 
was  the  number  of  measurement  (Mendenhall,  1971).  The  result  was  the  variance  of  the 
error,   Sj.     Using  Eq.  (4-5),  we  put  the  above  statements  into  mathematical 
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warping  path  between  target  and  source  signal 


35  40 
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Figure  4—3.  The  warping  path  between  the  source  and  target 
speech. 
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Figure  4-4.   The  LMR  functions  for  each  set  of  acoustic  parameter. 
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Figure  4—4.  Continue. 
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Figure  4-5.  The  pitch  contours  and  the  gain  contour  of  the  source 
speech,  the  converted  speech  and  the  target  speech. 
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Figure  4-6.  The  formant  tracks  of  the  source  speech,  the  converted 
speech  and  the  target  speech. 
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Figure  4-7.  The  speech  waveforms  of  the  source,  the  converted  and 
the  target  speech. 
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Figure  4-8.  The  spectrograms  of  the  source,  the  converted  and 
the  target  speech. 
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equations, 


m  m 


^ik  -  ai  •  Yik  -  bi) 


(4-6) 


k=l  k=l 


(4-7) 


where  aj  is  the  corresponding  scaling  scalar,  bj  is  the  corresponding  offset  scalar,  Xj  and 
are  the  /th  acoustic  feature  for  the  target  and  source  speaker,  respectively. 

Since  SSEi  is  dependant  on  the  magnitude  of  the  measured  data,  we  further  divide 
it  by  the  mean  value  of  the  measured  data.  The  result  is  called  the  normalized  standard 
deviation  of  the  error  (NSDE). 


Table  4-2  depicts  the  NSDE  value  for  each  acoustic  parameter  of  this  experiment. 

From  visual  inspection,  the  waveform  envelope  (gain  contour)  and  the  pitch 
contour  of  the  converted  speech  were  not  close  to  that  of  the  target  speech.  It  was 
expected  since  the  linear  mapping  function  was  used  to  convert  the  gain  contour  and  the 
pitch  contour.  However,  the  average  gain  and  pitch  period  of  the  converted  speech  were 
equal  to  those  of  the  target  speech.  An  informal  listening  test  confirmed  that  the 
converted  speech  sounds  like  a  female's  voice  and  very  similar  to  the  target  speaker  in 
perception. 

4.3.3  Experiment  2 

The  test  speech  token  of  this  experiment  was  the  sentence  "We  were  away  a  year 
ago,"  spoken  by  a  normal  male  and  a  male  with  vocal  disorder.  The  subject  was  to 
convert  the  speech  with  normal  voice  to  sound  like  the  one  with  pathological  voice. 


A 


NSDE  = 


m 


(4-8) 


k=l 
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Table  4-2.  NSDE  for  each  acoustic  parameters  of  Experiment  1 


pitch  contour 

gain  contour 

0.0603 

0.3356 

glottal  pulse  (LF  parameters) 

te 

tc 

ta 

Ee 

0.0259 

0.0391 

0.2187 

0 

0.1922 

formant  frequency 

F2 

F3 

F4 

F5 

0.1171 

0.1008 

0.0368 

0.1433 

0.0592 

As  Experiment  1,  the  LF  model  and  LP  model  were  selected  as  the  source  and  the 
vocal  tract  model  of  our  speech  synthesizer.  Five  types  of  the  acoustic  features  were 
measured  for  each  speech.  The  affine  transformation  was  used  to  convert  the  pitch 
contour,  the  gain  contour  and  the  glottal  pulse  and  the  vocal  tract  response  of  the  source 
speech  to  those  of  the  target.  Figure  4—9  illustrates  the  warping  path  between  the  two 
tokens.  Figure  4-10  depicts  the  LMR  functions  for  the  five  acoustic  features.  The 
converted  acoustic  parameters  are  shown  in  Figure  4-11  and  4-12.  The  speech 
waveforms  and  spectrograms  of  the  source  speech,  the  converted  speech  and  the  target 
speech  are  shown  in  Figure  4-13  and  Figure  4-14,  respectively. 

An  informal  listening  test  revealed  that  the  converted  speech  sounded  more  like  a 
pathological  voice  than  the  normal  voice.  However,  the  quality  of  the  synthesized  speech 
was  judged  inferior  to  the  result  of  Experiment  1.  Table  4-3  depicts  die  NSDE  value  for 
each  acoustic  parameter  of  this  experiment.  Comparing  with  Table  4-2,  die  NSDE  values 
of  all  categories  were  a  little  larger  than  Experiment  1,  which  may  indicate  larger 
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warping  path  between  target  and  source  signal 


Figure  4-9.  The  warping  path  between  the  source  and  target 
speech. 
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Figure  4-10.   The  LMR  functions  for  each  acoustic  parameters. 
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Figure  4-10.  Continue. 
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pitch  conversion  (source:daslied ;  converted:  solid ;  target:dotted) 
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Figure  4-11.  The  pitch  contour,  the  gain  contours  of  the  source 
speech,  the  converted  speech  and  target  speech 
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converted  formant  track 
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Figure  4—12.  The  formant  tracks  of  the  source  speech,  the  converted 
speech  and  target  speech. 
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Figure  4—13.   The  signal  waveforms  of  the  source  speech,  the  con- 
verted speech  and  target  speech. 
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Figure  4-14.   The  spectrograms  of  the  source  speech,  the  converted 
speech  and  target  speech. 
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distortion  in  mapping  parameters.  It  may  be  due  to  the  fact  that  the  test  token  was  a 

sentence  which  contained  more  dynamic  features. 


Table  4—3.  NSDE  for  each  acoustic  parameters  of  Experiment  2 


pitch  contour 

gain  contour 

0.2352 

0.4998 

glottal  pulse  (LF  parameters) 

h 

te 

tc 

ta 

Ee 

0.0669 

0.0751 

0.1983 

0 

0.3476 

formant  frequency 

F2 

F3 

F4 

F5 

0.1294 

0.1427 

0.0342 

0.1163 

0.0543 

4.3.4  Experiment  3 

In  order  to  validate  our  algorithms,  we  conducted  about  sixty  other  experiments 
using  similar  approaches  as  described  above.  Speech  tokens  for  the  same  sentence  were 
spoken  by  five  males  and  two  female,  including  two  pathology  voices.  Two  words  and 
one  sentence  were  used  in  this  research,  "you",  "veal",  and  "We  were  away  a  year  ago." 
As  illustrated  in  Table  4-1,  four  types  of  the  conversion  methods  were  used  in  the 
experiments.  We  also  used  different  excitation  source  model  and  vocal  tract  model  for 
our  synthesizer.  An  informal  listening  test  was  performed  to  evaluate  the  performance  of 
our  algorithms.  Overall,  the  quality  of  the  synthetic  speech  produced  by  our  system  is 
good.  A  preliminary  evaluation  result  shows  that: 

1.   the  affine  transformation  was  preferred  over  the  translation  transformation. 
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2.  the  affine  transformation  was  more  effective  in  converting  words  than  sentences. 

3.  our  conversion  algorithm  was  more  robust  in  the  male/male  conversion  than  the 
male/female  conversion. 

4.4  Discussion 

A  particular  troublesome  factor  in  our  conversion  algorithm  is  the  spectral 
conversion,  especially  for  the  sentence.  Proper  measurement  of  the  spectrum,  including 
formant  frequencies  and  the  bandwidths  is  essential  to  synthesize  high  quality  speech  for  a 
formant  synthesizer.  In  fact,  speech  synthesized  by  our  formant  synthesizer  is  inferior  to 
that  synthesized  by  our  LP  synthesizer  with  other  identical  parameter  settings.  This 
implies  that  the  measured  formant  track  is  not  accurate  as  we  assumed.  As  a  result,  the 
mapping  function  is  dubious.  To  solve  this  problem,  we  might  have  to  develop  a  more 
robust  formant  estimate  algorithm.  We  plan  to  do  further  investigation  on  the  problem. 

The  object  of  our  experiments  was  to  test  the  performance  of  our  voice  conversion 
system.  These  experiments  indicate  that  our  voice  conversion  system  has  the  potential  for 
converting  the  speech  from  from  one  speaker  to  that  of  another.  We  found  that  the  affine 
transformation  method  was  effective  for  converting  two  voices  with  similar 
characteristics,  e.g.,  male/male  conversion  and  intonation  conversion. 

4.4.1  Translation  Transformation  Method  vs.  Affine  Transformation  Method 

We  have  found  the  translation  transformation  is  less  effective  than  the  affine 
transformation,  especially  for  male/female  conversion.  The  translation  has  only  one 
degree  of  freedom  to  adjust  speaker  differences.  Therefore  it  is  less  able  to  adapt  to  local 
transitions.  Furthermore,  the  translation  may  produce  an  error  when  mapping  two 
subjects  that  are  very  different  in  nature.  For  instance,  the  converted  pitch  period  could  be 
negative,  if  the  dynamic  range  of  the  source  pitch  contour  is  bigger  than  the  translation 
scalar.  In  fact,  a  pitch  period  that  is  less  than  1.3ms  is  very  difficult  to  synthesize  and  may 
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result  in  unwanted  noise.  As  a  result,  the  bias  method  is  not  suitable  for  converting  voices 
with  large  dynamic  range. 

4.4.2  Word  Conversion  vs.  Sentence  Conversion 

Our  experiments  indicate  that  our  methods  are  more  effective  in  converting  words 
than  sentences.  Perhaps  this  is  because  a  sentence  has  more  dynamical  changes  than  the 
capacity  of  our  linear  mapping  methods.  One  way  to  solve  this  difficulty  is  to  segment 
the  speech  signal  into  phonemes,  and  estimate  piecewise  mapping  functions  instead  one 
transformation  function  for  the  whole  speech.  This  also  implies  tiiat  our  time  invariant 
assumption  is  not  valid.  From  preUminary  hand  segmentation  experiments,  we  found  that 
the  mapping  function  varied  from  phoneme  to  phoneme,  especially  for  the  formant  track 
transformation.  We  hypothesize  that  including  a  phoneme  detector  would  improve  the 
performance  of  our  voice  conversion  system. 

4.4.3  Gain  Contour  Transformation 

The  gain  contour  is  an  important  factor  in  characterizing  voice,  because  it  is 
related  to  the  loudness  of  the  voice.  We  tested  botii  transformation  methods  on  converting 
the  gain  contour.  From  visual  inspection  of  the  converted  gain  envelope,  both  methods 
seemed  to  fail  in  catching  the  dynamic  details  of  the  gain  contour.  Recall  that  one  special 
constraint  of  our  system  is  that  the  average  value  of  the  gain  contour  does  not  very  from 
one  speech  to  another,  since  speech  is  normalized  in  amplitude  in  pre-process.  Thus,  the 
major  difference  of  two  gain  contours  is  from  the  difference  of  their  envelopes,  which  is 
nonlinear  and  difficult  to  quantify. 

In  an  informal  listening  test,  the  gain  contour  converted  by  either  method  was 
judged  as  very  similar  to  the  target  speech.  This  finding  contradicted  our  expectation  and 
may  imply  that  the  gain  contour  was  not  a  major  factor  in  converting  voice.  We  plan  to  do 
further  investigation  on  this  factor. 
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4.4.4  Glottal  Source  Transformation 

In  our  normal/vocal  disorder  voice  conversion  experiments,  we  found  the  glottal 
source  was  a  primary  cue  to  access  the  personality  of  voice.  The  converted  speech  was 
judged  very  similar  to  the  target  quality,  if  we  used  the  pitch  contour,  the  gain  contour  and 
the  glottal  pulse  of  the  target,  or  obtained  by  linear  transformation  method,  while  retain 
the  formant  structures  of  the  source.  This  implies  that  the  glottal  excitation  source  greatly 
affected  the  perception  quality  and  personality  of  the  synthesized  speech.  In  a  study 
conducted  by  ChUders  and  Ahn  (1995),  there  is  a  significant  difference  in  four  LF  model 
parameters  among  the  three  voice  types,  modal  voice,  vocal  fry  and  breathy  voice.  Our 
experiments  also  support  the  previous  findings  of  others  that  the  pulse  length,  the  pulse 
width,  pulse  skewness,  abruptness  of  closure  and  the  spectral  tilt  of  the  glottal  pulse  are 
essential  to  the  quality  of  synthetic  speech  (Rosenberg,  1971;  Naik,  1984). 

4.4.5  Male/Male  Conversion  vs.  Male/Female  Conversion 

The  transformation  methods  were  found  to  be  more  effective  in  male/male 
conversion  than  in  male/female  conversion.  This  may  be  due  to  the  fact  that  female 
speech  has  a  high  fundamental  frequency,  which  may  interact  with  the  first  formant. 
Generally  speaking,  the  quality  of  synthetic  female  speech  is  inferior  to  that  of  male 
speech.  We  also  found  that  the  female  speech  has  more  dynamic  variations  than  male 
speech. 

4.4.6  Factors  Responsible  for  Synthetic  Quality 

One  objective  of  our  experiments  has  emphasized  on  the  the  synthetic  quality  of 
the  speech  generated  by  the  speech  analysis-synthesis  system.  These  experiments 
convinced  us  that  several  analysis  and  synthesis  factors  were  essentially  important  for 
obtaining  quality  synthetic  speech.  These  factors  include: 

1.   Proper  measurement  of  the  spectrum,  including  formant  frequencies  and  the 
bandwidths.  Speech  synthesized  by  the  formant  synthesizer  is  inferior  to  that 
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synthesized  by  the  LP  synthesizer  with  the  same  excitation  source  model.  In 
addition,  the  continuity  of  formant  tracks  is  strongly  correlated  to  the  perceptive 
quahty,  which  agrees  with  earlier  work  by  other  researchers  (Juang,  1984). 

2.  The  voicetype  classification  is  sensitive  to  the  quality  of  synthetic  speech.  This 
is  because  we  use  two  types  of  the  excitation  source  models  in  the  synthesizers 
and  they  have  different  spectral  and  temporal  characteristics. 

3.  The  pitch  contour  influences  the  speech  in  two  ways.  First,  the  fundamental  pitch 
period  is  an  important  cue  to  assess  the  speaker  identity,  however  a  5%  shift  in 
the  fundamental  pitch  period  will  not  alter  the  voice  perception  on  synthetic 
speech.  Secondly,  a  pitch  fluctuation  is  very  difficult  to  quantify  and  mimic.  It 
reflects  a  personal  speaking  style. 

4.  The  gain  transition  between  voiced  and  unvoiced  segments  is  important  to 
produce  noise-free  synthetic  speech.  If  the  gain  is  increased  sharply  from  an 
unvoiced  region  to  a  voiced  region,  the  synthesized  speech  was  judged  as 
inferior.  A  smoother  gain  contour  is  preferred  by  listeners. 

5.  The  glottal  excitation  pulse  should  be  close  to  the  original  as  possible,  especially, 
if  the  target  has  a  vocal  disorder.  Glottal  waveform  parameters  related  to  glottal 
timing  events  are  perceptually  relevant,  e.g.,  duration  of  the  open  glottal  interval, 
duration  of  the  closed  glottal  interval,  duration  of  the  opening  of  the  glottis  and 
duration  of  the  closing  of  the  glottis.  In  1995,  Shue  established  a  voice 
conversion  procedure  for  several  voice  types,  such  as  modal,  vocal  fry  and 
breathy  voices,  based  on  the  mapping  tables  between  vocal  quality  and  the  glottal 
pulse  timing  parameters.  Our  experimental  results  support  his  approach. 

4.5  Summary 

This  chapter  describes  our  approach  to  convert  voice  from  one  speaker  to  that  of 
another  speaker.   This  research  is  one  model  for  studying  factors  responsible  for  the 
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quality  of  synthetic  speech,  for  mimicking  voices,  and  for  speaker  normalization.  The 
voice  conversion  algorithms  are  based  on  a  speaker  adaptation  model  that  treats  speaker 
differences  as  arising  from  a  parametric  transformation.  The  voice  conversion  task  is 
realized  as  the  mapping  between  two  sets  of  parameters. 

There  are  four  types  of  conversion  methods  used  in  this  study.  The  affine 
transformation  method  proved  to  be  effective  for  voice  conversion  of  words,  but  less  so 
for  sentences.  One  possible  way  to  solve  this  difficulty  is  to  include  a  phoneme  detector 
in  our  system,  and  estimate  the  mapping  function  for  each  phoneme.  We  also  found  that 
replication  of  the  spectral  characteristics  of  a  speaker's  voice  is  essential  to  creating  high 
quality  synthetic  speech. 

The  experiments  demonstrated  the  flexibility  and  controllability  of  our  voice 
conversion  algorithms.  This  research  has  direct  application  to  speaker  normalization.  If 
we  are  able  to  discover  the  rules  by  which  one  speaker  can  be  converted  or  transformed  to 
sound  like  that  of  another,  then  presumably  all  speakers  could  be  converted  to  sound  like 
one  speaker,  thus  achieving  speaker  normalization.  Our  algorithms  provide  a  mechanism 
for  learning  these  rules  for  four  types  of  acoustic  parameters. 


CHAPTER  5 
SOFTWARE  SYSTEM  FOR  VOICE  CONVERSION 


The  purpose  of  this  chapter  is  to  introduce  the  software  system  used  in  this  study. 
This  system  is  a  graphic-user-interface  (GUI)  with  features  that  can  assist  the  user  to  1) 
select  the  analysis  algorithms,  2)  inspect  and  correct  the  analysis  parameters,  3)  align  the 
acoustic  parameters  of  speakers  with  various  speaking  rates  using  dynamic  time  warping, 
4)  modify  the  acoustic  parameters,  and  5)  execute  the  synthesis  processes  and  display  the 
synthesized  speech. 

The  software  system  requires  a  set  of  acoustic  features  that  are  measured  from  the 
original  speech.  The  features  are  voicetype  classification,  the  pitch  contour,  the  formant 
frequency  and  bandwidths,  the  shape  of  the  glottal  waveform  and  the  gain  contour.  As 
illustrated  in  Figure  5-1,  the  features  are  abstracted,  modified  and  synthesized  in  the 
analysis  phase,  the  modification  phase  and  the  synthesis  phase,  respectively.  A  simplified 
diagram  for  the  operating  procedures  is  shown  in  Figure  5-2. 

This  software  program  is  called  VOCOS  (VOice  Conversion  System).  All 
modules  and  the  graphic  user  interfaces  are  implemented  in  Matlab.  Figure  5-3  illustrates 
the  main  function  window  of  VOCOS  that  provides  the  functions  to  accomplish  the 
desired  assignment 

5.1  Speech  Analysis 

The  analysis  program  is  based  on  the  fixed-frame  LP  prediction  analysis.  The 
user  can  specify  the  analysis  parameters  to  accommodate  a  variety  of  speech  samples. 
Figure  5^  illustrates  a  group  of  GUI  windows  for  analysis.  The  main  analysis  window  is 
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Figure  5-1.  The  block  diagram  of  the  voice  conversion  software  system. 
*Not  available  for  all  analysis  parameter  settings. 
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Figure  5-2.  A  simplified  operation  procedure  diagram. 


Figure  5-3.  The  main  function  window  of  VOCOS. 
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Figure  5-4.  The  analysis  windows,  (a)  The  main  analysis  window. 

(b)  The  analysis  parameter  specification  window. 

(c)  The  popup  window  for  loading  the  speech  file. 

(d)  The  popup  window  for  saving  the  analyzed  resuh. 


shown  in  Figure  5-l(a).  The  topmost  button,  "Specification,"  is  used  to  specify  the 
analysis  parameters  in  the  specification  window,  otherwise  the  program  will  use  the 
default  values  as  shown  in  Figure  5-4(b).  The  user  can  import  the  speech  signal  by 
clicking  the  "Load  speech  file"  button  and  a  popup  window  will  show  the  available 
speech  files  in  the  subdirectory  /anadata/.  A  visual  inspection  function  window  will 
appear  for  the  user  to  inspect  and  correct  the  analyzed  results,  if  the  "Visual  correction- 
option  is  checked.  The  analysis  process  starts  by  pressing  the  "Execution"  button.  At  the 
end  of  the  analysis,  the  user  may  save  the  analyzed  results  by  clicking  the  "Save  analysis 
result"  button,  and  type  the  filename  in  the  editable-text  panel,  which  is  shown  in  Figure 
5-4(d).  The  last  button,  "Return  to  main  menu,"  will  close  all  the  analysis  GUI  windows 
and  return  the  user  to  the  main  VOCOS  function  window. 

5.2  Analysis  Result  Inspection  and  Correction 

Sometimes,  the  automatic  analysis  procedure  will  produce  unwanted  errors  that 
can  result  in  poorly  synthesized  speech.  The  group  of  GUI  windows  discussed  here  are 
intended  to  alleviate  the  problem,  as  well  as  provide  insight  to  the  acoustic  features.  The 
main  visual  inspection  window  is  depicted  in  Figure  5-5.  The  three  topmost  buttons 
supply  the  inspection  functions  for  visualizing  die  analyzed  results  and  provide  a  means  to 
correct  errors.  The  fourth  button,  "Save  Corrected  Data,"  lets  the  user  save  the  corrected 
results  in  a  file,  as  specified  in  Figure  5^(d).  These  functions  can  be  automatically 
evoked  in  the  analysis  phase,  if  the  "Visual  correction"  option  is  checked  in  the  main 
analysis  window,  or  by  simply  pressing  the  "Correction"  button  in  the  main  VOCOS 
function  window  after  die  analysis  phase.  However,  it  is  suggested  that  die  user  do  the 
visual  inspection  and  correction  in  the  analysis  phase,  so  that  the  subsequent  analysis 
results  are  in  accord  with  the  previous  results. 
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Figure  5-5.  The  visual  correction  window  for  selecting  the  type  of 
acoustic  parameters  to  inspect  and  correct. 
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5.2.1  Voicetvpe  Inspection 

The  voicetype  classification  is  important  for  identifying  the  glottal  closure  instants 
and  other  acoustic  parameters.  Figure  5-6  shows  the  inspection  and  correction  interfaces 
for  this  acoustic  feature.  The  lower  window  displays  the  entire  speech  signal  waveform 
and  the  corresponding  voicetype  classifications.  The  upper  window  displays  a  small 
portion  of  the  signal  and  suppUes  the  visual  function  buttons  to  inspect  and  switch  the 
voicetype  classifications.  The  range  of  the  frames  displayed  in  the  upper  window  can  be 
shifted  to  the  left  or  right  by  pressing  the  "<"  or  ">"  button,  respectively  The  user  can 
change  the  voice  type  of  a  fi-ame  by  cUcking  the  "Change  voicetype"  button  (note  that  the 
mouse  cursor  becomes  a  "+"  sign),  then  move  the  cursor  to  specify  the  frame  (under  the 
cursor)  by  executing  another  click.  This  procedure  turns  the  voiced  frame  into  unvoiced, 
or  vice  versa.  Press  the  "Apply"  button  to  accept  the  correction  or  the  "Undo"  to  cancel 
the  correction.  The  user  can  close  these  two  windows  by  clicking  the  "return"  button. 

5.2.2  Glottal  Closure  Instant  Inspection 

The  glottal  closure  instant  (GCI)  is  essential  for  source  modeling  and  speech 
synthesis  because  pitch  synchronous  modeling  and  synthesis  are  used  in  this  system.  The 
inspection  and  correction  windows  for  this  acoustic  feature  are  depicted  in  Figure  5-7. 
The  lower  window  displays  the  corresponding  pitch  contour  to  assist  the  user  to  locate  the 
GCIs  that  are  in  error.  The  upper  window  displays  a  small  portion  of  the  signal  and  the 
selected  GCIs  are  circled  in  blue.  The  upper  window  also  provides  the  interactive 
function  buttons  to  inspect  and  correct  the  location  of  the  GCIs.  The  user  can  zoom  in  or 
zoom  out  of  the  plot  by  selecting  the  "Zoom  in"  or  "Zoom  out"  button.  Each  time,  the 
user  clicks  on  one  of  these  buttons,  the  axes  limits  will  be  changed  by  a  factor  of  2  (or  the 
value  specified  by  the  user).  The  display  range  can  be  shifted  to  the  left  or  right  by 
pressing  the  "<"  or  ">"  button,  respectively.  The  shift  range  can  be  altered  by  typing  a 
new  value  in  the  editable  panel  (the  third  from  the  left).  By  pressing  the  "Add  one  gci"  or 
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(b) 

Figure  5-6.  The  windows  for  inspecting  the  voice  type  classifications. 

(a)  The  inspection  and  correction  window  for  the  voice  type 
classification,  (b)  The  voice  type  display  window  for  the 
entire  speech  signal. 


i 
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the  "Delete  one  gci"  button,  the  user  can  insert  or  delete  a  GCI  at  the  point  which  has  the 
maximal  residual  value  under  the  mouse  cursor. 

5.2.3  Formant  Track  Inspection 

The  formant  track  can  be  inspected  by  selecting  the  formant  configuration  as  the 
vocal  tract  model  in  the  analysis  phase.  Figure  5-8  shows  the  inspection  and  correction 
interfaces  for  this  function.  The  lower  window  displays  the  formant  track  of  the  entire 
speech  signal  for  inference.  Note  that  the  colored  circles  represent  the  measured  formant 
frequencies  for  each  frame.  The  upper  window  displays  a  small  portion  of  the  track  and 
supplies  the  visual  function  buttons  to  inspect  and  correct  the  formant  frequencies.  An 
errant  formant  frequency  is  corrected  as  follows.  Click  the  "Select  the  formant  to  correct" 
button  (note  that  the  mouse  cursor  becomes  a  "+"  sign)  and  move  the  cursor  to  that 
formant,  then  press  the  left  mouse  key.  The  selected  formant  will  be  denoted  by  a  cross  in 
red.  The  user  can  alter  the  formant  frequency  by  clicking  the  "+"  button  to  increase  the 
frequency  or  the  "-"  button  to  decrease  the  frequency.  As  before,  the  display  range  can  be 
shifted  to  the  left  or  right  by  pressing  the  "<"  or  ">"  button,  respectively.  Press  the 
"Apply  the  correction"  button  to  accept  the  correction  or  the  "Undo"  to  cancel  the 
correction.  When  finished,  the  user  closes  these  two  windows  by  clicking  the  "Return" 
button. 

5.3  Parameter  Modification 

The  main  modification  window,  as  illustt-ated  in  Figure  5-9,  is  evoked  by  the 
"Modification"  button  in  the  main  function  window.  Besides  two  file  buttons,  "load"  and 
"save,"  the  window  has  five  modification  function  buttons  and  a  "Return"  button.  The 
usage  of  the  modification  functions  and  associated  GCI  windows  are  described  below. 
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Figure  5-7.  The  windows  for  inspecting  GCIs. 

(a)  The  visual  inspection  and  correction  window  for  GCIs. 

(b)  The  display  window  of  the  corresponding  pitch  contour 
to  assist  the  visual  inspection  of  GCI. 
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Figure  5-8. 


The  windows  for  inspecting  the  formant  track. 

(a)  The  inspection  and  correction  window  for  the  formant  track. 

(b)  The  display  window  for  the  formant  track. 
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Figure  5-9.  The  window  for  modifying  the  acoustic  parameters. 
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5.3.1  Voicetype  Classification 

Figure  5-10  shows  the  GUI  window  for  modifying  the  voicetype  classifications. 
The  lower  portion  of  the  window  displays  the  voicetype  classifications  for  the  entire 
speech  signal.  The  blue  line  represents  the  original  voicetype,  and  the  red  "+"  represents 
the  new  voicetype  provided  by  the  user's  instruction.  The  upper  portion  of  the  window 
supplies  the  function  buttons  to  switch  the  voicetype  classifications.  The  number  and 
voicetype  of  the  selected  frame  are  shown  in  the  third  and  fourth  panel  ft-om  the  left, 
respectively.  The  selected  firame  can  be  changed  by  pressing  the  "<"  or  ">"  button,  or 
typing  the  frame  number  in  the  third  panel.  Note  that  the  selected  frame  is  circled  in 
green  in  the  plot.  The  voicetype  of  the  frame  can  be  switched  by  clicking  the  "Change 
voicetype"  button.  This  procedure  turns  the  voiced  frame  into  unvoiced,  or  vice  versa. 
Press  the  "Apply"  button  to  accept  the  correction  or  the  "Undo"  button  to  cancel  the 
correction.  The  user  may  close  this  GUI  window  by  clicking  the  "return"  button. 

5.3.2  Pitch  Contour 

Figure  5-11  shows  three  GUI  windows  for  modifying  the  pitch  contour:  the  left 
window  supplies  the  function  buttons  to  alter  the  factors;  the  upper  window  plots  the  pitch 
wave  in  black  and  the  pitch  jitter  in  red;  and  the  lower  window  displays  the  fundamental 
pitch  period  in  blue  and  the  pitch  contour  in  black.  As  stated  in  Chapter  3,  the  pitch 
contour  consists  of  three  factors  that  can  be  altered  separately.  The  fundamental  pitch 
period  is  controlled  by  the  topmost  slider:  click  the  left  side  button  of  the  slider  to 
decrease  the  value,  or  the  right  side  button  to  increase  the  value.  The  jitter  value  is 
controlled  by  the  lower  slider  and  can  be  changed  in  a  similar  manner.  The  pitch  wave  is 
modified  in  a  procedure  as  described  below. 

In  the  middle  of  the  left  window  there  are  two  menus  and  one  button  for  modeling 
the  pitch  wave.  The  "label  wave  I  delabel  wave"  popup  menu  is  used  to  mark  or  delete  the 
critical  point  on  the  pitch  wave  as  a  knob  that  controls  the  shape  of  the  modeled  wave.  In 
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Figure  5-10.  Window  for  modifying  the  voicetype  classifications. 
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Figure  5-10.  Windows  for  modifying  the  pitch  contour. 

(a)  The  function  window,  (b)  The  window  for  displaying 
the  pitch  wave  and  jitter  (c)  The  window  for  displaying 
the  fundamental  pitch  period  and  the  pitch  contour. 
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general  speaking,  the  more  knobs  that  are  created  on  the  waveform,  the  better  the  fit.  The 
"poly-fit  I  spline-fit"  popup  menu  supplies  two  types  of  polynomial  functions  to  model 
each  segment  of  the  pitch  wave  that  is  separated  by  knobs:  "poly-fit"  finds  the  optimal 
second  order  polynomial  for  each  segment  of  the  waveform  in  a  mean-square-error 
sense,  while  "spline-fit"  find  a  serial  optimal  cubic  polynomials  for  the  entire  pitch  wave 
in  a  piecewise  smoothness  sense.  That  means,  the  modeled  wave  by  second-order 
polynomials  may  not  be  differentiated  continuously.  Note  that  the  modeled  wave  is 
drawn  in  green  and  the  knobs  are  marked  by  green  crosses.  The  "draw  wave"  button 
enables  the  knob  to  be  moved  in  the  vertical  direction  to  a  design  point,  not  necessary 
located  on  the  pitch  wave,  such  that  the  waveform  is  modified.  For  example,  press  the 
"label  wave"  button  and  wait  for  the  cursor  to  become  a  "+"  sign.  Move  the  cursor  to  the 
destination  point  on  the  plot  and  press  the  left  mouse  key  again.  This  procedure  will 
move  the  knob  closest  to  the  cursor  to  the  new  position  and  a  new  modeled  wave  will  be 
generated  subsequentiy  (this  process  may  takes  a  few  seconds  to  complete).  Press  the 
"Apply"  button  to  construct  the  new  pitch  contour.  To  cancel  the  modification  and  reload 
tiie  previous  pitch  contour  and  its  factors,  click  the  "Undo"  button.  Press  the  "Return" 
button  to  close  these  three  windows  and  return  to  the  main  modification  function  window. 

5.3.3  Gain  Contour 

Figure  5-12  shows  the  GUI  windows  for  modifying  the  gain  contour.  The  lower 
window  depicts  the  gain  contour.  The  left  window  supplies  the  function  buttons  to 
modify  the  two  factors  of  the  gain  contour,  the  gain  envelop  and  the  gain  perturbation, 
which  are  displayed  in  the  upper  window  in  black  and  red,  respectively.  The  gain 
envelope  and  perturbation  are  altered  separately  as  follows.  The  perturbation  is  controlled 
by  a  slider.  Press  the  left  side  of  the  slider  to  decrease  the  value,  or  the  right  side  to 
increase  the  value.  The  procedures  for  modifying  the  gain  envelope  are  similar  to  those 
for  modifying  the  pitch  wave  as  discussed  above.  To  cancel  the  modification  and  reload 
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Figure  5-12.  Windows  for  modifying  the  gain  contour. 

(a)  The  function  window,  (b)  The  window  for  displaying 
the  gain  envelope  and  perturbation  (c)  The  window  for 
displaying  the  gain  contour. 
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the  previous  gain  contour  and  its  factors,  click  the  "Undo"  button.  Press  the  "Return" 
button  to  close  these  three  windows  and  return  to  the  main  modification  function  window. 

5.3.4  Glottal  Waveform 

This  type  of  modification  is  available  for  LF  source  modeling.  Figure  5-13  shows 
the  GUI  windows  for  modifying  the  differentiated  glottal  flow.  The  upper  window 
provides  the  function  buttons  to  alter  the  LF  timing  parameters  and  select  the  frame  of 
interest.  The  lower  window  illustrates  the  corresponding  quasi-periodic  waveform  and 
spectrum.  The  LF  timing  parameters  as  weU  as  the  displayed  frame  can  be  changed  by 
pressing  the  respective  sliders.  However,  the  user  should  be  careful  about  setting  the  LF 
timing  parameters.  An  error  message  will  occur  if  the  setting  is  errant,  such  as  Tp  is 
bigger  than  Te.  <  .  -    ;  ' 

5.3.5  Spectrogram 

This  modification  function  is  available  for  using  the  formant  configuration  as  the 
vocal  tract  model.  Figure  5-14  illustrates  the  graphic  user  interfaces  used  to  modify  the 
spectrogram  of  the  synthesized  speech.  The  left  window  supplies  the  control  buttons  to 
shift  the  formant  track  up  and  down.  The  uppermost  popup  menu  in  the  left  window 
enables  the  user  to  select  the  formant  to  be  modified.  To  alter  the  value,  the  user  can  click 
the  slider  or  type  a  new  value  in  the  editable-text  panel  (the  third  from  top).  The  resulting 
formant  track  will  overlap  the  original  track  in  the  formant  track  display  window.  The 
user  may  accept  the  changes  by  clicking  the  "Apply"  button  or  reload  the  original  tracks 
by  pressing  the  "Undo"  button. 

5.3.6  Voice  Conversion 

Figure  5-15  shows  the  GUI  windows  for  voice  conversion.  The  first  row  of  three 
buttons  allow  users  to  set  the  local  constraints,  the  distortion  measures  and  the  display  of 
searching  the  optimal  path  for  DTW,  respectively.  The  second  row  of  buttons  supply  four 
conversion  methods  for  four  types  of  acoustic  features.    Users  can  select  different 
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Figure  5-13.  Windows  for  modifying  the  differentiated  glottal  flow. 

(a)  The  function  window,  (b)  The  window  for  displaying 
the  quasi-periodic  waveform  and  spectrunx 


Figure  5-14.   Windows  for  modifying  the  spectrogram,  (a)  The  function 
window,  (b)  The  window  for  displaying  the  spectrogram. 
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combination  of  conversion  methods  to  simulate  the  voice  conversion  process.  Tow 
buttons  in  the  third  row  enable  users  to  load  the  target  speech  file  and  source  speech  file, 
respectively.  Note  that  those  files  must  contain  the  five  types  of  acoustic  features  of  the 
speech  signal  and  be  saved  in  Matiab  format.  Press  the  "Apply"  button  to  start  the  voice 
conversion  process  or  "Cancel"  to  reset  the  parameter  setting.  After  completing  tiie  voice 
conversion  process,  "you  may  run  the  synthesizer"  message  will  display  in  the  Matiab 
command  window.  Users  can  close  this  GUI  window  by  cUcking  the  "Return"  button  and 
go  the  speech  synthesis  phase  to  generate  the  converted  speech. 

S  A  Sppp-ch  Svnthesis        V  ' 

The  speech  synthesis  GUI  can  be  evoked  by  clicking  the  "Synthesis"  button  on  the 
main  function  window.  Figure  5-16  shows  the  GUI  windows.  The  user  can  load  die 
analyzed  data  file  by  pressing  the  "Load  analyzed  file"  button,  otiierwise  the  synthesizer 
will  use  the  data  stored  in  the  temporary  memory  by  Matiab.  Press  the  "Synthesize" 
button  to  start  tiie  synthesis  process  using  either  an  LP  syntiiesizer  or  a  formant 
synthesizer.  The  synthesized  speech  may  be  saved  by  pressing  the  "Save  synthesized 
speech"  button  and  specifying  the  filename  in  tiie  editable-text  panel.  At  the  completion 
of  the  synthesis,  a  popup  window  will  appear  in  the  center  of  tiie  screen  to  display  the 
synthesized  results,  as  shown  in  Figure  5-16(d).  By  clicking  the  "play"  button,  the  speech 
signal  is  played  through  the  Matiab  default  speakers.  The  user  can  view  tiie  spectrogram 
of  the  speech  signal  by  pressing  the  "spectrogram"  button,  which  is  depicted  in  Figure 
5-1 6(e). 


5.5  Summarv 


A  voice  conversion  software  program  called  VOCOS  has  been  implemented.  The 
major  features  of  the  system  are  1)  tiie  parameterization  of  acoustic  features,  2)  parameter 
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Figure  5-15.   GUI  >\^ndows  for  voice  conversion,  (a)  The  function 
window,  (b)  The  window  foe  displaying  the  process 
of  searching  the  optimal  path. 
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Figure  5-16.  The  synthesis  windows,  (a)  The  main  synthesis  window. 

(b)  The  popup  window  for  loading  the  analyzed  file. 

(c)  The  popup  window  for  saving  the  synthesized  result. 

(d)  The  window  for  displaying  the  speech  waveforms. 

(e)  The  window  for  displaying  the  speech  spectra. 


Figure  5-16.  Continued. 


Figure  5-16.  Continued. 
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visualization,  and  3)  a  friendly  user  interaction  for  modifying  parameters.  All  modules 
and  the  graphic  user  interfaces  are  implemented  in  Madab. 

There  are  four  phases  for  operating  the  software  system,  the  analysis,  the 
correction,  the  modification  and  the  synthesis  phase.  In  the  analysis  phase,  the  acoustic 
features  are  abstracted  from  the  original  speech  according  to  the  specifications.  Each 
feature  is  represented  by  one  set  of  parameters,  which  are  displayed  in  graphs.  The  user 
can  inspect  and  coirect  die  analyzed  results  through  the  various  interfaces  in  the 
correction  phase.  In  order  to  create  new  type  of  voice,  the  features  are  modified  in  the 
modification  phase  by  altering  the  parameter  values  through  a  group  of  graphic  user 
interfaces.  Each  set  of  parameters  can  be  modified  independendy.  Finally,  the  synthesis 
phase  constructs  the  new  speech  signal  and  the  resulting  speech  waveform  and 
spectrogram  are  displayed  in  graphs. 


CHAPTER  6 
CONCLUSIONS  AND  FUTURE  WORK 


The  primary  goal  of  this  research  was  to  develop  a  software-based  voice 
conversion  system  to  modify  the  characteristics  of  human  voice.  The  system  was 
intended  to  generate  high  quality  speech  tokens  for  speech  science  and  psychoacoustic 
studies.  The  results  of  this  study  will  be  of  interest  to  researchers  in  speech  analysis, 
speech  synthesis,  speaker  identification  and  speech  recognition. 

The  key  ideas  for  our  system  are  based  on  the  source-tract  speech  production 
model  which  is  a  highly  parametric  representation  for  speech  analysis  and  synthesis.  In 
our  system  the  acoustic  features  of  speech  are  described  by  speech  parameters  and  these 
parameters  can  be  transformed  to  mimic  another  speaker's  voice,  or  modified  to  create  a 
new  voice  in  the  acoustic  feature  space. 

The  voice  conversion  algorithms  are  established  on  a  speaker  adaptation  model 
that  treats  speaker  differences  as  arising  from  a  parametric  transformation.  The  voice 
conversion  task  is  then  realized  as  the  mapping  between  two  set  of  parameters.  We  found 
that  the  affine  transformation  is  effective  for  converting  single-syllable  words,  but  less  so 
for  sentences. 

6.1  Summary  of  Results 
6.1.1  Speech  Analysis  and  Synthesis 

The  speech  analysis  and  synthesis  system  used  in  this  research  is  a  realization  of 
the  source-tract  speech  production  model.  The  system  consists  of  three  subsystems,  a 
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speech  analyzer,  a  parameters  modifier  and  a  speech  synthesizer,  which  extracts,  modifies 
and  synthesizes  five  types  of  acoustic  features,  respectively.  The  features  are  the  formant 
frequency  and  bandwidth,  the  shape  of  the  glottal  pulse,  the  voicetype  classification,  the 
pitch  contour  and  the  gain  contour.  The  first  two  types  of  parameters  are  frame-based, 
and  they  represent  the  speaker's  characteristics  of  the  vocal  tract  and  the  glottal  folds, 
respectively.  The  final  three  parameters  form  the  controlling  parameters  for  our  system. 
One  major  feature  of  our  acoustic  model  is  that  the  controlling  parameters  are 
independent  of  the  other  parameters  so  that  they  control  the  way  of  how  the  frame-based 
information  concatenates,  such  as  changing  the  speaking  rate  or  increasing  the  voice 
volume.  This  makes  it  possible  to  mimic  the  characteristics  of  another  speaker's  voice, 
including  the  prosodic  features. 

The  hybrid  configuration  is  another  important  feature  of  our  system.  There  are 
two  types  of  waveform  models  for  the  voiced  excitation:  1)  6th  order  polynomial  model 
and  2)  transformed  LF  model.  Both  of  them  are  capable  of  producing  realistic  glottal 
waveforms  and  easy  to  estimate  their  parameters.  The  synthesis  system  can  synthesize 
speech  almost  perfectly  when  the  estimated  glottal  waveform  from  the  glottal  inverse 
filtering  process  is  used  as  the  excitation.  When  the  6th  order  polynomial  modeled  glottal 
waveform  is  used  as  the  excitation  source,  the  synthesized  speech  is  natural  and 
intelligible. 

There  are  also  two  types  of  schemes  to  represent  the  vocal  tract:  1)  linear 
prediction  filter  and  2)  formant  filter.  The  LP  scheme  is  able  to  reproduce  the  spectrum  of 
all  speech  sounds  while  lacks  the  physically  meaningful  for  simulating  the  human  speech 
production.  However,  it  is  difficult  to  estimate  the  resonant  tract  parameters  (formant 
frequencies  and  bandwidths)  for  the  formant  scheme.  The  LP  scheme  offers  a  better 
quality  of  synthesized  speech  for  both  voiced  and  unvoiced  speech,  while  no  modification 
is  applied.  On  the  other  hand,  the  formant  scheme  produces  better  synthesized  speech  if 
the  formant  track  is  modified. 
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In  sum,  there  are  four  different  combinations  to  analyze  and  synthesize  speech. 
However,  we  are  more  interested  in  the  studying  the  combination  of  the  transformed  LF 
source  model  and  the  formant  tract  configuration,  since  the  parameters  of  both  models  are 
correlated  with  the  physiology  of  speech  production.  The  other  three  combinations  build 
a  comparison  base  to  study  the  effectiveness  or  our  analysis  and  synthesis  algorithms. 

6.1.2  Voice  Modification 

In  our  design,  the  speech  signal  is  described  by  three  sets  of  measurements  of 
acoustic  features,  the  source  parameters,  the  resonant  parameters  and  the  controlling 
parameters,  and  each  set  of  the  parameters  is  independent  from  one  another.  Although  the 
acoustic  features  are  described  by  parameters,  it  is  not  straight  forward  to  alter  the 
analyzed  parameters  so  that  the  synthesized  speech  will  have  the  desired  effect.  The 
modified  parameters  have  to  be  consistent  with  the  speech  production  model  in  order  to 
produce  high  quality  speech. 

We  proposed  a  pitch  contour  model  to  describe  and  control  the  pitch  contour  that 
records  the  value  of  pitch  period  along  the  time  axis.  The  proposed  model  consists  of 
three  factors,  the  fundamental  pitch  period,  the  pitch  wave  and  the  pitch  jitter.  Each  pitch 
factor  has  its  own  perceptive  importance  and  is  independent  from  each  other.  This  model 
makes  it  possible  to  create  or  mimic  various  types  of  voices,  such  as  the  voicetype 
conversion,  the  intonation  pattern  conversion,  etc. 

We  also  developed  a  similar  model  to  modify  the  gain  contour  that  records  the 
excitation  gain  for  each  pitch  period.  In  that  model,  the  contour  consists  of  the  gain 
envelope  and  the  gain  perturbation.  Both  gain  factors  have  their  own  perceptive 
importance  and  can  be  altered  independentiy. 

The  formant  track  can  be  modified  by  a  scale  factor,  mouse-drawing  or  copying 
from  other  tracks  and  the  new  filter  coefficients  can  be  obtained  by  reverse  the  process. 
However,  the  direct  construction  of  the  resonant  filter  from  the  formant  poles  sometimes 
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results  in  the  spectral  deviation  from  the  designed.  To  relieve  this  problem,  we  developed 
a  pole  compensation  algorithm  to  modify  the  formant  bandwidths  such  that  the 
corresponding  formant  energy  is  kept  the  same  level  as  designed.  The  formant  spectrum 
modified  with  our  algorithm  is  closer  to  the  designed  spectrum  than  that  without  our 
algorithm. 

6.1.3  Voice  Conversion 

Our  voice  conversion  algorithms  are  based  on  a  speaker  adaptation  model  that 
treats  speaker  differences  as  arising  from  a  parametric  transformation  so  that  the  acoustic 
features  of  one  speech  (source)  match  the  desired  (target).  The  voice  conversion  task  is 
then  reahzed  as  the  mapping  between  two  set  of  parameters.  There  are  four  types  of 
mapping  methods  used  in  our  studies,  the  translation  transformation,  the  affine 
transformation,  the  copy  method  and  the  retain  method.  The  coefficients  of  the  mapping 
function  are  determined  by  the  linear  multivariable  regression  algorithm.  We  are  more 
interested  in  studying  the  effectiveness  of  the  first  two  methods  on  converting  voice, 
while  the  other  two  methods  build  a  comparison  base  to  study  the  effects  of  the  acoustic 
featiures  on  the  personality  of  the  synthesized  speech. 

Several  experiments  were  conducted  to  test  the  performance  of  our  voice 
conversion  algorithms.  The  affine  transformation  method  proved  to  be  effective  for 
converting  single-syllable  words,  but  less  so  for  sentences.  Perhaps  this  is  because  a 
sentence  has  more  locally  dynamic  changes  than  the  capacity  of  our  linear  mapping 
methods.  One  possible  way  to  improve  is  to  include  a  phoneme  detector  in  our  system 
and  estimate  the  piecewise  mapping  functions  instead  of  one  linear  function  for  the  entire 
speech.  We  also  found  that  replication  of  the  spectral  characteristics  of  the  speaker's 
voice  is  essential  to  create  high  quality  synthetic  speech. 

The  experiments  demonstrated  the  flexibility  and  controllability  of  our  voice 
conversion  algorithms.  This  research  has  direct  application  to  speaker  normalization.  If 
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we  are  able  to  discover  the  rules  by  which  one  speaker  can  be  converted  or  transformed  to 
sound  like  that  of  another,  then  presumably  all  speakers  could  be  converted  to  sound  like 
one  speaker,  thus  achieving  speaker  normalization.  Our  algorithms  provide  a  mechanism 
for  learning  these  rules  for  four  types  of  acoustic  parameters. 

The  conversion  procedures  also  provides  a  systematic  method  for  examining  the 
relationship  between  synthesized  voice  characteristics  and  the  acoustic  parameters.  It  also 
provided  the  capability  to  establish  a  data  base  for  different  voice  types,  which  can  be 
used  in  training  a  speech  recognition  system  (Childers  and  Ann,  1994). 

6.2  Future  Work 

1.  Improvement  for  our  voice  conversion  system 

Our  experiments  indicate  that  our  methods  are  more  effective  in  converting  words 
than  sentences.  Perhaps  this  is  because  a  sentence  has  more  dynamical  changes  than  the 
capacity  of  our  linear  mapping  methods.  One  way  to  solve  this  difficulty  is  to  segment 
the  speech  signal  into  phonemes,  and  estimate  piecewise  mapping  functions  instead  one 
transformation  function  for  the  whole  speech.  This  also  implies  that  our  time  invariant 
assumption  is  not  valid.  We  hypothesize  that  including  a  phoneme  detector  would 
improve  the  performance  of  our  voice  conversion  system. 

2.  ANOVA  analysis  on  the  independent  assumption 

We  hypothesize  that  the  five  acoustic  features  extracted  from  speech  are  linear 
independent  from  one  another  and  each  feature  can  be  modified  separately.  However,  we 
would  Uke  to  conduct  a  series  tests  on  this  hypothesis  using  our  flexible  voice  conversion 
system.  The  listening  test  scores  will  be  analyzed  by  the  standard  analysis  of  variance 
(ANOVA). 
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3.  Speech  perception  in  voice  conversion 

In  voice  conversion  process,  an  objective  criterion  that  function  as  an  alternative 
for  speech  perception  should  be  studied  in  order  to  accelerate  tiie  conversion  process  and 
narrow  the  range  of  parameters  for  each  voice  types  (Kitawiki  and  Nagabuchi,  1988) 
Besides,  a  formal  Ustening  test  along  with  a  statistic  analysis  are  suggested  to  investigate 
the  effectiveness  of  the  voice  conversion  rule. 

There  is  still  a  need  for  additional  systematic  investigations  to  discover  tiie 
relationship  of  the  basic  acoustic  features  of  speech  thought  to  be  reflective  of  personality 
to  the  speaker's  perceptive  personality.  Moreover,  our  system  can  be  improved  with 
better  formant  estimation  tiiat  we  think  is  an  important  factor  for  tiie  quality  of 
synthesized  speech  and  our  flexible  analysis-synthesis  system  can  be  extended  to  include 
this  improvement  without  much  effort. 

We  also  would  like  to  conduct  more  experiments  on  more  subjects  to  gain  insight 
of  tiie  relationship  between  tiie  basic  acoustic  features  and  the  speaker's  perceptive 
personality.  Since  tiie  features  of  tiie  glottal  source,  the  vocal  tract  response,  the  gain 
contour  and  the  pitch  contour  are  all  involved  in  speech  studies  such  as  speaker 
identification  and  speech  recognition,  this  speech  analysis/modification/synthesis  system 
can  serve  as  a  tool  for  future  applications.  Furthermore,  this  system  is  also  capable  to 
serve  as  a  tool  for  psychoacoustic  or  linguistic  studies  because  of  the  easy  manipulation  of 
the  pitch  contour  as  well  as  the  gain  contour. 

4.  Model  for  the  unvoiced  source 

We  have  demonstrated  that  unvoiced  sounds  are  very  sensitive  to  the  gain 
modification.  Inadequate  modeling  of  the  unvoiced  excitation  might  be  the  reason. 
Although  we  adopted  LP  scheme  to  describe  the  vocal  tract  for  unvoiced  sounds,  the 
transformation  of  unvoiced  sounds  is  not  completed  in  our  system.  An  adaptive  modeling 
of  the  unvoiced  excitation  might  be  a  good  start  to  conquer  this  problem. 


APPENDIX 

DYNAMIC  TIME  WARPING  ALGORITHM  IMPLEMENTATION 


A.l  Introduction  to  Dynamic  Time  Warping 

This  appendix  describes  the  implementation  of  dynamic  time  warping  algorithms 
in  the  training  process.  In  Chapter  2,  we  discuss  the  procedures  of  collecting  the 
synthetic  parameters  that  are  defined  essentially  on  a  short-time  basis.  That  is,  each 
short-time  speech  segment  is  represented  by  three  sets  of  measurements  of  acoustic 
features,  the  source  parameters,  the  resonant  parameters  and  the  intonation.  When 
comparing  these  parameters  of  one  speaker  to  those  of  another  speaker,  the  durations  of 
phonemes  are  usually  different  among  speakers  and  results  in  complicated  situations. 
Thus  there  is  a  need  to  align  these  two  speech  in  the  same  time  scale  in  order  for  the 
utterance  transformation  taking  place.  The  problem  of  time-alignment  can  be  formulated 
as  a  path  finding  problem  as  illustrated  below. 

Consider  two  speech  tokens  X  and  Y,  represented  by  the  sequences  (xi,  X2,...,  xn) 
and  (yi,  y2,...,  yivi),  respectively,  where  Xi  and  yi  are  parameter  vectors  of  the  short-time 
acoustic  features.  And  define  ix  and  iy  as  the  time  indices  of  X  and  Y,  where  ix  =  1,  2,...,  N 
and  iy  =  1,  2,...,  M.  Note  that  the  durations,  N  and  M,  need  not  be  identical.  The  distance 
between  X  and  Y  is  defined  by  collecting  some  function  of  the  short-time  distances 
d(Xj^,  yj^,  which  will  be  denoted  for  simplicity  of  notation  as  d(ix,  iy)  without  ambiguity. 
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The  total  distance  between  X  and  Y  is  defined  simply  as 


N 


D(X,Y)  =  ^d(ix,iy) 
ix-l 


(A-1) 


iy  =  4>(ix) 


(A-2) 


Our  goal  is  to  find  the  function  (j),  which  represents  the  mapping  path  between 
speech  tokens  X  and  Y,  to  minimize  the  total  distance.  Generally  speaking,  the  mapping 
function  is  nonlinear  and  depends  on  the  contents.  As  shown  in  Figure  A-1,  the 
minimization  process  can  be  thought  as  a  path  finding  problem  over  a  finite  grid. 
Dynamic  time  warping  (DTW)  is  one  of  the  widely  used  solutions  for  this  type  of 
problems  (Sakoe  and  Chiba,  1971;  Myers  and  Rabiner,  1980;  Rabiner  and  Juang,  1993). 

In  order  to  find  the  optimal  path  in  the  (ix,  iy)  plane  to  minimize  the  total  distance 
function  D,  several  factors  of  the  DTW  algorithm  must  be  specified,  including: 

1.  endpoint  constraints  on  the  path, 

2.  local  continuity  constraints,  i.e.,  the  possible  slopes  of  the  path, 

3.  global  path  constraints,  i.e.,  the  limitations  on  where  the  path  can  fall  in  the  (i,, 


4.  axis  orientation,  i.e.,  the  effects  of  interchanging  the  roles  of  the  test  and 
reference  patterns, 

5.  distance  measures,  i.e.,  both  the  local  distance  measure  and  the  overall  distance 
metric  used  to  determine  the  optimal  path. 

In  the  remainder  of  this  section  we  discuss  each  of  these  factors  and  show  how 
they  affect  the  implementation  of  the  DTW  algorithm. 


iy)  plane, 
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Figure  A-1.  An  example  illustrate  the  grid  for  warping  Y(iy)  to  X(ix). 
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A.2  Our  Constraints  on  Finding  Optimal  Path 

A.2.1  Endpoint  Constraints 

For  the  case  of  voice  conversion  with  precisely  determined  endpoints  for  both  the 
reference  and  test  tokens  (as  assumed  here),  the  parametric  path  endpoint  constraints  are 
of  the  form 

iy  =  <J>(ix  =  1)  =  1.  beginning  point.  (A-3) 
iy  =  <l>(ix  =  N)  =  M,     ending  point  (A-4) 

In  other  words,  both  tokens  start  and  end  at  the  same  points.  This  factor  is  fixed  in  our 
implementation. 

A.2.2  Local  Continuity  Constraints 

To  specify  the  reasonable  path,  some  local  constraints  must  be  applied  in  order  to 
make  sure  that  excessive  compression  or  expansion  of  the  time  scales  is  avoided.  A 
primary  constraint  of  this  type  is  the  monotonicity  constraint,  namely, 

(l)(ix,)  >  (l)(ix,)  if  ix.  >  ix,  (A-5) 

Table  A-1  shows  a  summary  of  the  types  of  local  constraints  that  have  been  used  in  our 
software  system. 

A.2.3  Global  Path  Constraints 

Because  of  the  local  path  constraints,  certain  parts  of  the  (ix,  iy)  plane  are  excluded 
from  the  region  in  which  the  optimal  warping  path  can  lie.  This  constraint  is  very 
important  to  reduce  the  total  cost  of  computation.  Suppose  the  maximum  slope  and 
minimum  slope  of  the  selected  local  constraint  are  Smax  and  Snun,  respectively.  The 
allowable  region  is  defined  by  the  four  straight  lines,  two  lines  starting  fi-om  the  beginning 
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Table  A-1 .  Summary  of  sets  of  local  constraints  and  the  resulting  path 
specifications  of  our  software  system. 


type 


allowable  path  specification 


y1 

P2  P3 


Pi  ^  (1.0) 
P2->(1.1) 
P3^(0.1) 


n 


Pi  ^  (i.iKi.O) 

P2^a.i) 

P3^  (1,1X0.1) 


m 


P3 


Pl  ^  (2,1) 
P2-^(l.l) 

Pa-^CU) 


IV 


P2  P4 


Pl^(l,l)(l,0) 
P2^(1.2)(l,0) 
P3^(l.l) 
P4-^(l,2) 


Itakura 


Pi->(1.0) 
P2^(l.l) 

P3-*(1.2) 


Itakura  I 


Pi^d.O) 
P2->(1.1) 
P3^(1.2) 
P4-^(1.3) 
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point  and  the  other  two  lines  from  the  ending  point,  as  shown  in  Figure  A-2.  We  included 
this  concept  in  implementation  the  algorithms. 

A.2,4  Axis  Orientation 

Usually  the  optimal  path  is  dependent  on  the  axis  orientation,  that  is, 
^(h)  ^  ({^(iy)-  This  is  a  very  important  factor  when  selecting  the  target  signal  and  the 
source  signal  using  our  software.  Only  in  cases  when  the  local  constraints  and  the 
distance  metric  are  symmetric,  there  are  no  differences  between  the  assignments  of  x-axis 
(test)  and  y-axis  (reference). 

A.2,5  Distance  Measure 

Measuring  the  difference  between  two  speech  patterns  in  terms  of  average  or 
accumulated  spectral  distortion  appears  to  be  a  very  reasonable  way  of  comparing 
patterns.  A  number  of  distance  measures  have  been  proposed  and  used  in  various  types  of 
application  to  distinguish  the  speech  patterns  (Rabiner  and  Juang,  1993).  Table  A-2 
shows  a  list  of  the  spectral  distortion  measures  used  in  our  software. 

A.3  Implementation 

In  order  to  implement  this  algorithm  in  a  dynamic  programming  maimer,  two 
additional  principles  are  used, 

1.  a  globally  optimal  path  is  also  locally  optimal; 

2.  the  optimal  path  to  the  grid  point  (i^,  iy)  only  depends  on  its  preceding  points 
(ix',iy')  where  ix'  <  ix,  iy'  ^  iy. 

These  two  principles  defines  the  standard  dynamic  programming  recursive 
relationship.  Using  them  along  with  the  global  constraints  and  local  constraint,  a  partial 
accumulated  distance  function  D^(n,  m),  representing  the  accumulated  distance  along  the 
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Table  A-2.  Summary  of  spectral  distortion  measures  used  in  our  system. 


distortion  measure 


mathematical  expression 


log  spectral  distance 


I 

-a 


log  "FT 


'S'(w) 


dw 
2k 


cepstral  distance 


L 

1 

n  =  l 


X(Cn-c'„)^ 


Itakura  distortion 


a^Rpa 


likelihood  ratio  distortion 


a>Rpa 


log— ^  -  1 


a 
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best  path  from  (1,1)  to  (n,  m),  can  be  written  as 

=  n,  iy  =  m 

D^Cn.m)  =  min     ^     ci(ix,  (|)(ix))  (A-6) 

i,  =  l,  iy  =  l 

The  function  D;^(n,m)  depends  only  on  the  paths  from  (1,1)  to  (n,m)  and  can  be  used 
recursively  where  n  <  N,  m  <  M  In  short,  the  implementation  of  the  DTW  algorithm 
is  a  three  step  procedure. 

1.  Initialization:  Set  0^(1, 1)  =  d(l,l). 

2.  Recursion:  Compute  T>^(n,  m)  recursively  for  1  <  n  <  N,   1  <  m  <  M. 

3.  Termination:  Set  D(X,Y)  =  Da(N,  M). 

Figure  A-3  shows  an  example  of  dynamic  time  warping.  The  bottom  signal  is  the 
reference  speech,  the  left  one  is  the  target  speech,  and  the  optimal  path,  in  the  center  of  the 
plot,  is  obtained  by  the  DTW  algorithm. 
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